Graph node rank based important keyword detection from Twitter

Social media networks like Twitter, Facebook, WhatsApp etc. are most commonly used medium for sharing news,opinionsandtostayintouchwithpeers.Messagesontwitterarelimitedto140characters.Thisledusers tocreatetheirownnovelsyntaxintweetstoexpressmoreinlesserwords.Freewritingstyle,useofURLs,markupsyntax,inappropriatepunctuations,ungrammaticalstructures,abbreviationsetc.makesitharderto mineusefulinformationfromthem.Foreachtweet,wecangetanexplicittimestamp,thenameoftheuser,thesocialnetworktheuserbelongsto,oreventheGPScoordinatesifthetweetiscreatedwithaGPS-enabled mobiledevice.Withthesefeatures,Twitteris,innature,agoodresourcefordetectingandanalyzingtherealtimeeventshappeningaroundtheworld.ByusingthespeedandcoverageofTwitter,wecandetectevents,a sequenceofimportantkeywordsbeingtalked,inatimelymannerwhichcanbeusedindifferentapplicationslikenaturalcalamityreliefsupport,earthquakereliefsupport,productlaunches,suspiciousactivitydetection etc.ThekeyworddetectionprocessfromTwittercanbeseenasatwostepprocess:detectionofkeywordintherawtextform(wordsaspostedbytheusers)andkeywordnormalizationprocess(reformingtheusers ’ unstructured words in the complete meaningful English language words). In this paper a keyword detection technique based upon the graph, spanning tree and Page Rank algorithm is proposed. A text normalization technique based upon hybrid approach using Levenshtein distance, demetaphone algorithm and dictionary mappingisproposedtoworkupontheunstructuredkeywordsasproducedbytheproposedkeyworddetector. Theproposednormalizationtechniqueisvalidatedusingthestandardlexnorm1.2dataset.TheproposedsystemisusedtodetectthekeywordsfromTwitertextbeingpostedatrealtime.Thedetectedandnormalized keywordsarefurthervalidatedfromthesearchengineresultsatlatertimefordetectionofevents.


Introduction
Social media services provide access to enormous data but tendency to express in colloquial and breviate form hampers its utility in Natural Language Processing (NLP), Information Retrieval (IR), data mining and Machine Translation (MT) applications. Recently, Twitter, a popular micro-blogging service, has become a new information channel for users to receive and to exchange information. Every day, nearly 170 million tweets are created and redistributed by millions of active users. Twitter has several unique advantages that distinguish it from news websites, blogs, or other information channels. With the brevity guaranteed by a 140-character-message limit and the popularity of Twitter's mobile applications, users tweet and re-tweet instantly. First, tweets are created in real-time. For example, we could detect a tweet related to a shooting crime just after 10 min of shot get fired, while the news report would approximately appear 3 h later. Second, tweets have a broad coverage over events. On Twitter, millions of general users, as well as verified accounts such as news agents, organizations and public figures, are constantly publishing new tweets. Every user can report news that is happening around him or her. Thus, tweets cover nearly every aspect of daily life, from national breaking news (e.g., earthquakes), local events (e.g., car accidents), to personal feelings. Third, tweets are not isolated; they are associated with rich information by E-mail or by any other method. But the tweets are not posted in any standard format. This lack of standardization hampers NLP and MT tasks and renders huge volume of social media data useless. Therefore, there is a need to reform such text forms into standard forms. This can be achieved by normalization which is a preprocessing step for any application that handles social media text. Process of converting ill formed words into their canonical form is known as normalization. Text normalization is challenging due to the colloquial nature of tweets. For example: repeating characters such as "gooood" (can refer to god or good), presence of phonetic errors (nite → night), use of acronym (ikr → I know really) are some of the commonly seen traits in social media text.

Related work
Graph based keyword extraction techniques can be both supervised and unsupervised, context dependent and context independent. In this research work, many contextindependent unsupervised graph based keyword extraction techniques have been explored. KeyWorld is an automatic indexing system which has been proposed by Matsou et al. [19] which extracts candidate keywords by measuring their influence on small-world properties. It captures characteristic path lengths and extended characterstic path lengths. This algorithm has been inspired by small-world phenomenon and keyGraph algorithm proposed by Ohsawa et al. [23]. Thereafter, Erkan et al. [9] proposed LexRank which is insensitive to noise in text and calculates importance of sentence (or word) using eigenvector centrality. Mihalcea et al. [20] proposed graph based TextRank model which has been originated from the concept of PageRank. The author further improved TextRank further for text summarization. In 2007, Palshikar [24] proposed hybrid and statistics based approach for keyword extraction using co-occurrence frequency measure. The author described eccentricity based keyword identification, other centrality measure based keyword extraction and proximity based keyword identification. Litvak et al. [18] proposed HITS based algorithm for keyword extraction. In 2009, for event detection and tracking in social streams, Sayyadi [29] used keyGraph algorithm which was proposed earlier by Ohsawa et al. [23]. Later, in 2011, the author introduced DegExt, a graph-based language independent keyphrase extractor. The author used degree centrality for keyword extraction. In 2013, Boudin et al. [34] compared various centrality measures for graph based Keyphrase extraction from short documents. Abilhoa et al. [1] proposed Twitter Keyword Graph (TKG) algorithm to extract keywords from Twitter data.
Normalization: Previous work attempted noisy channel model as one of the text normalization technique. Brill and Moore [8] characterized the noisy channel model based on string edits for handling the spelling errors. Toutanova and Moore [17] improved above model by embedding information regarding pronunciation. Choudhury et al. [22] proposed a supervised approach based on Hidden Markov Model (HMM) for SMS text by considering graphemic/phonetic abbreviations and unintentional typos. Cook and stevenson [25] expanded error model by introducing probabilistic models for different erroneous forms according to sampled error distribution. This work tackled three common types: stylistic variation, prefix clipping and subsequence abbreviations. Yang and Eisenstein [33] presented a unified log linear unsupervised statistical model for text normalization using maximum likelihood framework and novel sequential montocarlo training algorithm.
Some of the previous work was based on Statistical Machine Translation (SMT) approach for normalization. SMT deals with context-sensitive text by treating noisy forms as the source language and the standard form as the target language. Aw et al. [3] proposed an approach for Short Messaging Service (SMS) text normalization using phrase-level SMT and bootstrapped phrase alignment techniques. The main drawback of SMT approach is that it needs lot of training data and it cannot accurately represent error types without contextual information.
Some researchers also treated text normalization problem as a speech recognition problem. Kobus et al. [7] convert input tokens into phonetic forms and then applied phonetic dictionary lookup to restore them into words. Beaufort et al. [26] employed finite state methods for normalizing French SMS. Kaufmann and Kalita [16] proposed a machine translation approach for syntactic normalization (instead of lexical normalization). Literature also contained dictionary based normalization approaches. Saloot et al. [27] used dictionary approach with OOV and standard form pairs as its entries. But these approaches are highly dependent on dictionary size.
Normalization of social network text is a challenging task. Han and Baldwin [5] developed an approach using classifier for identifying non-standard words and then generate candidates based on morphophonemic similarity. Liu et al. [11] developed a technique for tweet normalization by introducing character-level Conditional Random Field (CRF) sequence labeler on the edit sequences (computed for OOV words). Along with it, unigram language model and phoneme and syllable features were taken into consideration.
Gouws et al. [30] developed an approach based on string and distributional similarity along with dictionary look-up method to deal with ill-formed words [4]. Introduced similar technique based on distributional similarity and string similarity. Selection of correct forms was performed on pair-wise basis. (Hany [12] proposed an approach based on random walks on a contextual similarity bipartite graph constructed from n-gram sequences on large unlabelled text corpus. Mohammad Arshi et al. [21] proposed a tweet normalization approach. Firstly, candidates were generated by targeting lexical, phonemic and morphophonemic similarities. Then candidate selection was performed via three different probability scores (positional indexing, dependency-based frequency features and language model).
More recent approaches handle the text normalization using CRFs and neural networks. Min et al. [32] proposed a system where Long-Short Term Memory (LSTM) recurrent neural networks using character sequences and Part-Of-speech (POS) tags, had been used for predicting word-level edits. Leeman-Munk et al. [28] applied two forward field neural networks to predict normalized token for an ill-formed word. Wagner and Foster [14] proposed a generalized perceptron method to generate word edit operations with character ngrams, recurrent neural network (RNN) language model hidden activation features. Then, character language model selected the final normalization candidate. Yang and Kim [10] used an CRF based approach. CRF using both brown clusters and word embeddings were trained using canonical correlation analysis as features.
Lochter [15] proposed an ensemble system to automatically detect opinions in SMS which combine text normalization and semantic indexing techniques. Almeida [31] developed text processing approach based upon lexicographic and semantic dictionaries for semantic analysis and context detection. This technique can normalize terms as well as can create new attributes so as to change and expand original text samples in order to improve performance (redundancies and inconsistencies).

ACI 17,2
Kim et al. [13] proposed a technique for correcting misspelled words in twitter text using character n-gram method to deal with spelling information and the word n-gram method to tackle dependency of co-occurrence words. Abiodun Modupe [2] developed a semi-supervised probabilistic approach for normalizing informal short text messages. Language model probability had been used to enhance the relationships between formal and informal word. Then string similarity was employed with a linear model to include features for both wordlevel transformations and local context similarity.

The proposed work
The architecture of the proposed system is given in Figure 1. It consists of the two blocks. The function of upper block is to extract the keywords in unstructured form. (i.e. in the form of illformed words). The lower block role is to normalize the keywords which are unstructured in nature, to the normal and understandable form.

Keyword detection from Twitter
Real time tweets are extracted from the Twitter and are used to create a directed weighted graph. The nodes of the directed weighted graph are the words constituting the individual tweet and there is a directed edge between two nodes if one node (word) precedes the other in the tweet. The directed graph G is given by where V is the set of vertices and E is the set of edges. Let W be the set of all tweet words then and if there is a word sequence (tweet) in the form of where t i ∈ V then there will be a directed edge from e i to e j such that if e i and e j ∈ V and e i immediately preceeds e j in some T i The weight of the edge is to be increased each time the edge is repeated in any tweet. Once a directed weighted graph is constructed, it is passed for the generation of maximum spanning tree generator. The maximum spanning tree generation is performed as per the Algorithm 1.
The weighted directed graph is passed to the Algorithm 1, which generates the maximum spanning tree using the Kruskal's minimum spanning tree generation concept. The maximum spanning tree so generated can be looked as the set of multiple word sequences which are supposed to be the most talked by users as they are having the maximum path lengths. The words contained in the maximum path may be considered as the important keywords in unstructured form. The Page rank algorithm is used to assign importance score to the keywords and the keywords having scored more than a predefined threshold are extracted as the most important keywords which may further lead to event detection. The unstructured keywords are transformed by using the below-mentioned normalization process.
In the normalization step the tokenization of input is performed in order to extract strings. Text refining is applied on extracted strings and thereafter categorization into In-vocabulary (IV) and Out -of -vocabulary (OOV) lexicons is performed. Candidate Generation stage generates list of possible correct words for an input OOV word. In the end, Candidate Selection stage selects a best possible candidate from all generated candidates.
Token Qualifier segregates input into two heaps: OOV and IV words. OOV tokens detected by token qualifier, will be processed by the candidate generator which will generate possible normalized candidates via three different techniques: Levenshtein distance, demetaphone algorithm and dictionary mappings. Candidate selector module will work on candidate list generated by the candidate generator and will generate best possible candidate for each OOV. The Token qualifier functions as according to Algorithm 2.
OOV words of dataset (output of Algorithm 2) act as input to Algorithm 3. According to research, correct formed English words having repeating characters have maximum of two character repetitions. Thus, repetition of more than two characters in a string is trimmed off to two characters (helloooo → hello, gooood → good). Regular expressions are applied to OOV strings with alphanumeric text. Some of transformations with example are given below: 4 → fore (B4 → bfore), 2 → to (2night → tonight), 9 → ine (F9 → fine) etc. After applying trimming and regular expressions, OOV words that are going to be processed further are obtained.

Keyword detection from Twitter
First technique to generate candidates for OOV word is through Levenshtein distance (also known as edit distance). Edit distance is the number of applied insertions, deletions, alterations in order to transform one string into another. It is used to handle spelling errors. Edit distance >2 results in generation of large number of candidates most of which are inappropriate and at same time are complex to process. So we prefer edit distance with ≤2. Algorithm 4 takes input of Algorithm 3 and generates strings having edit distance ≤2 with respect to input OOV. In order to have precise and limited generated candidate list, string similarity measures are applied on candidate list generated via edit distance (≤2).
In order to handle errors due to phonemes (words that sound same), demetaphone algorithm is used. Words like nite and night are phonemes of each other. In order to have limited, precise candidate set and to reduce processing complexity, string similarity measures are applied on phonemes generated by demetaphone Algorithm 5. Now a days, internet slangs like lol → laughing out loud and abbreviations (Cuz → because) are common in social media text. So at last we generate candidates using dictionary mapping. Output of Algorithm 6 is shown in Table 1.

ACI 17,2
Candidate list generated by all three techniques (output of Algorithm 4, 5 and 6) act as input to Candidate scorer (Algorithm 7). Equal probability to each candidate in list corresponding to a OOV lexicon, is assigned. Aggregate probabilities of all those candidates which are present in more than one list is calculated by performing summation on their probabilities. This will act as score. Prepare an aggregate list by combining candidate lists of all three candidate generation techniques.
Aggregate candidate list and score list prepared by Algorithm 7 will act as input to Algorithm 8. Select that candidate from aggregate candidate list (for an OOV lexicon) corresponding to which maximum scores are present in score list. In case, there are more than one candidate with same scores then apply part of speech tagging (POS Keyword detection from Twitter assign scores according to importance of context like noun is given highest weight followed by verb and then adjective. This will return a single best candidate for each incorrect word. Proposed modular approach, Algorithm 9, works on raw tweets. Preprocessing is done by removing unwanted strings (punctuations, hastags etc). Token qualifier is then called to detect OOV and IV words.
Rules are applied to the output of the token qualifier to generate the OOV tokens which will be used for further processing. Candidates are generated via Levenshtein distance (Algorithm 4), demetaphone algorithm (Algorithm 5) and dictionary approach (Algorithm 6). In order to select best possible normalized word corresponding to a OOV word, candidate scorer (Algorithm 7) and candidate selector (Algorithm 8) are employed. The words coming after normalization may contain the stop words, which were removed further.

Experimental set up and results
The proposed work is a combination of two parts: keword detection and normalization. Both the parts are implemented and validated.
The proposed normalization approach has been implemented on LexNorm 1.2 dataset which was an updated version of dataset for lexical normalization described in Han et al. [4]. This dataset contains English messages sampled from Twitter API (from August to October 2010). The dataset is annotated considering one-to-one as well as one-to-many token mappings (like ttyl → talk to you later).
Processed OOV tokens (on which normalization task is performed) are divided into three broad categories: letter, letter and number and others. Letter refers to those OOV tokens that contain only alphabetic text. Spelling errors (tomoroe tomorrow), phonetic errors (u → you), stylistic variations (NOE → know) and clippings (cuz → because) are some of the traits of this category. Letter and number refers to alphanumeric text. It contains phonetic errors mostly (b4 → before). Others refers to internet slangs (like lol → laughing out loud) and abbreviations (hw → homework).
96.2% of processed OOV tokens contain spell errors, phonetic substitutions, stylish text variations and prefix clippings. 2.04% tokens are ill formed exclusively due to phonetic errors and rest have short forms mainly due to slangs and abbreviations.
Normalization results are evaluated on the basis of Precision, Recall, F-score and BLEU score. Let T dataset be all tokens from dataset and let OOV t be the list of all detected OOV in dataset ∈ T dataset .gen t oov be the generated candidates for an oov ∈ OOV t . sel t oov be the best normalized candidate selected by system for an oovtoken, oov ∈ OOV t . cor t oov be the tagged correction for an oov ∈ OOV t . norm t oov be the set of normalized oov tokens ∈ OOV t normalized by system RecallðRÞ ¼ P t∈T dataset È sel t ooν : sel t ooν ¼ cor r ooν ; sel t ooν ∈ gen t ooν ; ooν ∈ OOV t É P t∈T dataset f ooν : ooν ∈ OOV t g (6)

Keyword detection from Twitter
BLEU (BiLingual Evaluation Understudy) score evaluates translation accuracy from one language to another. Translation done by human is considered as gold standard. This standard is compared with the machine translated version and then score is assigned between 0 and 1. If the machine translated file is exactly same as human translated file then a score of 1 is assigned and zero score indicates that these two files are very much different. We calculate BLEU score between normalized OOV tokens and their corresponding tagged correction.
Proposed normalization approach has achieved precision of 83.6%,recall of 83.6%, F-score of 83.6% and BLEU scores of 91.1%.   Experimental results are compared on three aspects. First, we compare the proposed modular approach with the previous techniques which use LexNorm 1.2 as dataset. Second, results of the proposed approach are compared with the existing unsupervised and supervised approaches. At last, intrinsic evaluation is explored for individual techniques employed in proposed modular approach. These comparative results are shown in Figures 2-4.
As shown in Figure 2, Modular approach outperforms the existing techniques which use the same dataset, in terms of precision, recall and F-score. Text normalization task of Yang and Eisenstein [33] has slightly low results (performance of 82.06%) followed by Bo Han et al. [4] having overall performance of 75.3% and Liu et al. [11] having recall of 66.81. Figure 3 shows that the proposed normalization approach yields better accuracy as compared to existing unsupervised methods. Modular approach has 1.54% better results than log linear model for unsupervised text normalization Yang and Eisenstein [33]. Moreover, an unsupervised model for text normalization proposed by Cook et al. [25] also has low performance (57.9% accuracy) than the proposed approach (having 83.6% performance).
Comparative study with supervised methods of Figure 4 shows that the proposed modular approach performs better than the existing supervised text normalization technique (91.1 BLEU scores). Mohammad Arshi et al. [21], tweets normalization approach using maximum entropy achieved an 83.12 BLEU scores followed by Phrase-based statistical model for SMS text normalization with BLEU scores of 80.7% (Aiti [3] and then syntactic normalization of twitter messages (by Kaufmann and Kalita [16] which has achieved 79.8% BLEU scores. Figures 2-4 validate the normalization part of the proposed work. The proposed keyword detection approach is used to detect the important keywords in unstructured form from the live feed of Twitter on 10-07-2017 for one thousand number of tweets in continuation after applying different Twitter filters. The results generated by the proposed keyword detection approach are shown in Table 2  As seen from the table the unstructured keywords falling under OOV words generated by the proposed approach do not possess any meaning but are important, hence need normalization. The proposed normalization is applied on these OOV words and the result is shown in the last column of Table 2, along with other kind of extracted keywords like in vocabulary words and proper nouns.
The normalized words so obtained were searched in combination with a particular filter and the results obtained in the form of events are presented in the Table 3. A careful inspection of Table 3 suggests that the search results obtained in response to different normalized keywords is the actual event happened in relation to the filter applied and hence justifies the proposed approach as significant toward an efficient event detection mechanism.

Conclusion
In this paper, a keyword detection technique based upon the directed graph, maximum spanning tree and Page Rank algorithm is proposed. A text normalization technique based upon Levenshtein distance, demetaphone algorithm and dictionary mapping is proposed to work upon the unstructured keywords as produced by the proposed keyword detector. The proposed normalization technique is validated using the standard LexNorm 1.2 dataset. The proposed system is used to detect the keywords from Twitter text being posted at real time. The detected and normalized keywords are further validated from the search engine results at later time for detection of events.