Automatic classification of older electronic texts into the Universal Decimal Classification – UDC

Purpose – Thepurposeofthisstudyistodevelopamodelforautomatedclassificationofolddigitisedtextsto the Universal Decimal Classification (UDC), using machine-learning methods. Design/methodology/approach – The general research approach is inherent to design science research, in whichtheproblemofUDCassignmentoftheold,digitisedtextsisaddressedbydevelopingamachine-learning classificationmodel.Acorpusof70,000scholarlytexts,fullybibliographicallyprocessedbylibrarians,wasusedtotrainandtestthemodel,whichwasusedforclassificationofoldtextsonacorpusof200,000items. Humanexpertsevaluatedtheperformanceofthemodel. Findings – Results suggest that machine-learning models can correctly assign the UDC at some level for almost anyscholarly text.Furthermore, the modelcan berecommendedforthe UDC assignment of oldertexts. Ten librarians corroborated this on 150 randomly selected texts. Research limitations/implications – The main limitations of this study were unavailability of labelled older texts and the limited availability of librarians. Practicalimplications – Theclassificationmodelcanprovidearecommendationtothelibrariansduringtheir classification work; furthermore, it can be implemented as an add-on to full-text search in the library databases. Social implications – The proposed methodology supports librarians by recommending UDC classifiers, thus saving time in their daily work. By automatically classifying older texts, digital libraries can provide a better user experience by enabling structured searches. These contribute to making knowledge more widely available and useable. Originality/value – These findings contribute to the field of automated classification of bibliographical informationwiththeusageoffulltexts,especiallyincasesinwhichthetextsareold,unstructuredandinwhich archaic language and vocabulary are used.


Introduction
Written sources are a cornerstone of cultural heritage and provide evidence of human creativity, development and culture in specific times and spaces.They are kept and cared for at libraries.
With the digitisation of written sources, books, periodicals, serials and other kinds of written sources have become more easily accessible to scientists and the public.National governments have invested much effort in the digitisation of complete libraries.In Europe, the digital library project Europeana (www.europeana.eu)works with thousands of European archives, libraries and museums to share cultural heritage for enjoyment, education and research.Europeana Collections provides access to over 57m digitised items (books, music, artworks and more), which means that users can access all the available knowledge online through digital libraries by full-text or categorised search.While newer texts and articles are typically equipped with metadata, such as subjects, keywords and the Universal Decimal Classification (UDC), older texts are not.The amount of archive texts and articles available through digital libraries is enormous, and it cannot be expected that librarians could perform the task of UDC classification on their own.It is estimated that several hundred thousand texts, published in the 19th and 20th centuries, will not be manually processed, nor will librarians produce bibliographic records in the library catalogue for those.Because these types of sources will probably not be catalogued (in contrast to scholarly articles, for which a large set of metadata is available), it will be difficult or even impossible to offer filters and faceted navigation (because the lack of available metadata, including classification, such as UDC).
The problem addressed in this paper is the assistance in bibliographic text processing of older digitised texts, which remain in the hands of human experts.These are currently mostly classified only to the general UDC group through the classification of the entire journal.The thesis of this paper is that machine-learning (ML) methods make it possible to automatically assign or propose texts to the appropriate UDC group or several of them.Based on this thesis, we have developed two research questions: RQ1. Can the UDC classification of the new scholarly texts, assigned by human experts, be used to build the UDC classification model?and RQ2.Can a classification model, built on scholarly texts, be used to classify older (unstructured) texts?
For this purpose, we have developed an ML classification model on the newer, correctly classified texts to use it for the classification of the old texts.This research aims to develop a methodology for automatic classification of electronic articles or texts into UDC.The structure of the paper is as follows: first, we start with the problem description and related work.We proceed with the methodology of the research, and the methods used for data collection and analysis.In the results section, we present the classification model, its validation and some use cases.Finally, we conclude with a discussion of the results and their implication for theory and practice.

Problem description
Like other digital libraries (e.g.Europeana, Open Library (www.openlibrary.org),Library of Congress (www.loc.gov/collections/) and others), there are vast numbers of free digital resources in the web.In the Republic of Slovenia, one of the richest and most complete free electronic sources is the Digital Library of Slovenia (www.dlib.si)containing more than 850,000 electronic (digitised and digital) publications.Digitised sources are those that were originally only printed and later transformed into a digital format, whereas digital sources are those that are originally created and made accessible in digital format (and may also be printed).In the following text, we will use the term "digital publications" for all written sources available through the Digital Library of Slovenia.Usually, items included in a library would have a bibliographical record.Scholarly publications are systematically bibliographically processed, which means they have a bibliographical record in the library catalogue, and therefore one or more classification numbers from the UDC system.In contrast, most older texts and sources have not been bibliographically processed (e.g.articles and texts from older printed magazines and newspapers from the field of culture) and therefore are not classified into UDC system.On the UDC Consortium's website, the classification is described as one of the first universal classification systems and remains one of the most widely used international classification systems in librarianship.It was developed at the end of the 19th century by the Universal Bibliographic Repertory, based on the Dewey Decimal Classification (DDC; 1876), by Melvin Dewey (Dale, 1978;Kendall, 2014).Other well-known systems are also the Library of Congress Classification (LCC), Colon Classification (CC) and Bibliographic Classification (BC) (Miksa, 2017).The UDC classification is based primarily on the numerical ("decimal") labelling of the contents of articles with line item sequencers in the form of a sequence of numbers and symbols; it is universally portable, highly structured, the notation is precisely representative of the subject content and understandable between languages.It is built so that it can be expanded and upgraded with new classifications.The classification is used worldwide and is currently translated into more than 50 languages.According to Slavic (2008), it is the second most used classification in the world.It allows unlimited assembly of classification attributes (e.g.master table, place, time, etc.) and relations between them to describe the subject (in our case) of the publication.Because publications in the union library catalogue (the COBIB union bibliographic/catalogue database) in Slovenia use the UDC classification scheme, we used this one.It contains nine basic groups: (1) Science and knowledge.For newspapers, they are usually classified as 070 -"Newspapers.Printing.Journalism" or 050 -"Serials.Periodicals".

The Universal Decimal Classification
On the website of the Digital Library of Slovenia, it is possible to search the content of the articles only through the full text.It is currently the best tool for discovering older texts.However, using and researching articles and other publications in such a way only does not offer good user experience, due to optical recognition deficiencies (poor quality of text recognition in newspapers and serials of the older type, use of old Slovene script like "metel cica", "dajn cica", "bohori cica", "gajica", etc.) and too many returned search results.For the majority of the texts and copies of serials, there is only one bibliographic record in the library catalogue.Examples of this include "The Laibacher Zeitung", a newspaper, with more than 58,000 issues and many more articles, Ljubljanski zvon-"The Bell of Ljubljana", with more than 11,000 articles, or Dom in svet-"The Home and World" with over 16,000 articles, etc.The easiest way to illustrate the present situation is the following example: all the articles of the serial "The Home and World" that originated between 1888 and 1944 are placed in the UDC classification sub-group Slovene Literature and Culture (821,163.6and 008) which means that every article within the magazine "Home and World" is classified as "Slovene Literature" and "Culture" and nothing more than that.If we mention some well-known magazines from those times and describe their content, these would be: "Home and World"-it was a Slovenian literary monthly, which was created as an entertaining and educational magazine for Roman Catholic readers and later developed into a literary magazine.It was founded by the philosopher and theologian Fran ci sek Lampe and edited until his death (1900).The magazine was initially distinctly Catholic but represented a more tolerant and, above all, the most artistically creative part of Catholic culture.Another one is "The Bell of Ljubljana"-the central Slovenian literary newspaper was published monthly.In addition to literature, it also contained art criticism as well as discussions and essays on the arts.At first, it was more scientifically oriented but later limited to the humanities, and from 1931 also with articles on current social issues.We can also mention "Agricultural and handicraft news"; it was first intended to help farmers and artisans, but later also carried articles in the fields of literature, conservative politics, culture and correspondence from various places.They were important mainly for the consolidation of the Slovene literary language, the general acceptance of the "gajica" and, in general, for the all-round cultural development of the Slovene nation.Gajica is a Latin script developed by the Croatian linguist Ljudevit Gaj.It was first used to write Croatian, but later, with some adaptations, it was also used to write Slovene.

Related work
Much research has been done on the classification of data in various fields.Data are everywhere, and its quantity is growing rapidly.Notably, the rise of data created on social media and the growth of online business transactions, which are expected to grow to 450bn transactions a day by 2020 (Khatri, 2016), is contributing to the expansion of the digital universe.
Text mining, as a subfield of data mining, has become increasingly important in recent years due to the wide range of resources that generate huge amounts of data such as social networks, blogs/forums, websites, emails and online libraries that publish research articles (Altınel and Ganiz, 2018).The main goal is to process and use the raw information in texts using ML algorithms (Bhushan and Danti, 2018).One of the problems encountered when analysing texts is that a text is usually in free form, unstructured but machine-learning algorithms usually need structured input.Text mining thus refers to extracting interesting patterns, clusters and high-quality information from large text corpora using ML and statistical learning (Kaushik, 2013).Text classification using machine-learning techniques, as an important tool for managing vast amounts of texts, is described in by Ikonomakis et al. (2005).
Text mining is gaining much attention in different scholarly fields, for example, analysing user evaluation of tourism services, especially hotel and tourism services (Jimenez-Marquez et al., 2019).Similarly, in the field of medicine, with more than 27m articles currently in PubMed database, it is increasingly difficult for researchers and healthcare professionals to efficiently search extract and synthesise knowledge from a variety of publications.Yi (2005) addresses the classification of bibliographic data from the library catalogue in MARC format and doctoral dissertation abstracts.The study seeks to achieve two goals, the use of the Markov Hidden Model to categorise the text, and the use of the Washington Library of Congress classification in conjunction with the Markov Model and data mining.For the set of publications, author used extracts from ProQuest's dissertation database.These are already categorised by librarians and offer the ideal test set to test the model.
Similarly to our proposed research, their goal was to classify the texts, but their corpus consisted of fully described texts in the library catalogue, although in another type of classification (Library of Congress Classification (LCC) -https://www.loc.gov/catdir/cpso/lcco/.There are, of course, other ways to classify texts, publications and articles.Erbs et al. (2013) present a hybrid approach to index term assignment, with a combination of key phrase extraction and multi-label classification, which is an extension of the automatic tagging of documents with the use of multi-label classification, which assigns labels to documents, which can be clustered by their labels and which are similar to tags or categories.Another study presents an approach to the assignment of the Library of Congress (LOC) subject headings with the usage of bibliographic records, DDC and abstracts of publications (Wartena and Franke-maier, 2018).
The interest in the topic of scholarly text classification and recommendation has grown in recent years.Regarding the classification of scholarly texts according to the UDC (Romanov et al., 2016), texts are classified by peers based on their keywords.Similarly, bibliographic metadata (title, description and subject tags) can be used to equip texts with DDC to supplement bibliographic records of publications (Khoo et al., 2015).Also, for e-news, support can be developed for automatic inclusion into pre-defined groups, such as art, function, news, reviews, sports, world (Asy'arie and Pribadi, 2009;Ramdass and Seshasai, 2009).One of the research studies on extracting the meaning of vocabulary from sentences, not only by categorising words and finding semantic connections is the one taken by Karras and Mertzios (2002), using DDC.
The spread of digital resources and their integration into the traditional library environment has created the need for an automated tool that organises publications into library classification schemes.Yi (2007) asserts that the search for automatic text classification is a research area for the development of tools, methods and models for use and operation in this field that author describes the currently popular approach for text sorting and lists some projects in the area of classification: sorting publications in libraries, most notably the LCC and DDC.A view on the other aspect, considering the fast development of digital repositories and growth of data and information, the utilisation of semi-automatic metadata generation may be unavoidable in the future (Park and Brenza, 2015).Some techniques include meta-tag and content extraction, automatic indexing, text and data mining, extrinsic data auto-generation, social tagging, among others.
A survey of methods, such as content-based, collaborative filtering, graph-based and hybrid methods can be found in the work of Bai et al. (2019).They identified main approaches to recommender systems and commonly used performance evaluation metrics (Precision, Recall, F-measure, NDCG, MAP, MRR, MAE, and UCOV).They outline the open questions that are relevant to all kinds of recommender systems (cold start, sparsity, scalability, privacy, serendipity) and unified scholarly data standards.Analysis of the use of recommendation-as-a-service for academia is presented in the study by Beel et al. (2017).Porcel et al. (2009) propose a model of a fuzzy linguistic recommender system to help the University Digital Library users accessing research content and collaborate.All the presented recommender systems address predominantly the user side, where our proposed model is primarily aimed to help librarians and consequently improve user experience.

The Universal Decimal Classification
There are several reasons for our choosing the UDC system as a target classification system in our research.The primary reason is that in the COBIB database (https://plus.cobiss.si/opac7/help/cobib), the results of shared cataloguing of more than 680 Slovenian libraries participating in the COBISS (Co-operative Online Bibliographic System and Services).In the SI system, all the newer material is already tagged with this classification system (UDC).Therefore, it is pragmatic to use the same classification system for equipping the older material as well.Second, UDC covers all knowledge sciences (Salah et al., 2012) and offers the following contributions as suggested in the study by Colillas (2011): (1) Classification codes (the identifier, number) can be used as a key to overcoming the problems experienced due to multi-lingualism (for example, the UDC number 61 has the same meaning, namely "medicine" in all languages).
(2) Harmonisation can help organise the entire architecture from a semantic perspective.Objects (in our case, records, articles, texts, as well as images, maps, etc.) are grouped by subjects, not by alphabetical relationships.
(3) Each new classification serves (like the others) as a code list and can be reused.

Methodology
The approach used in this research falls under the Design Science Research (DSR) (Hevner et al., 2004), in which an IT artefact (ML classification model) is being developed to solve a real-life problem (classification of old texts).The process of model development both builds on prior knowledge and contributes to new knowledge.Efficiency, quality and usefulness must be demonstrated through evaluation, and detailed and verifiable results are provided, considering clear and rigorous methods (Kuechler and Vaishnavi, 2008).
The identification of the problem, which represents the first activity in the DSR methodology, and the objectives were defined in the introductory chapters, in which we described the problem and objectives of the research.Based on the research question stated in the introduction section, we developed the following hypothesis: H1.It is possible to build a classification model on the corpus of scholarly texts that would assign any randomly chosen publication from the test set (of scholarly texts) to at least one appropriate UDC number with the probability of at least 0.8.
We will test Hypothesis 1 using the performance measures commonly used in text classification: classification accuracy (CA), Recall, Precision and F1.
H2.The classification model built on a corpus of scholarly articles can assign at least 80% of publications to one suitable UDC number for each popular, unstructured, old text.
We will test Hypothesis 2 with human experts (librarians) evaluating the performance of the classification model on 150 randomly selected classified texts.

Data preparation
At the core of the design cycle, as described by Hevner et al. (2004), is the classification model building process.The research process of data collection and data analysis methods, which are presented in Figure 1, are described in detail as follows.
We exported the attributes described below for each text into a .jsonfile to be further processed from the MSSQL database in which we store the data.In total, we exported more than 70,000 scholarly articles and more than 200,000 old texts.In order to uniquely distinguish content and input attributes for model construction, the database included the "Title", "Full-text of article", "Identifier" and "UDC numbers".For the old texts, the UDC numbers were those defined for whole magazines, journals or newspapers, as already stated.
This phase can also be called the "pre-processing" of data.We cleaned out the words that were not useful and only introduced stop words into the model.It is a process of cleaning the text (deleting non-alphanumeric characters, blank lines, etc.).
The process of text classification model building starts with preparing a corpus of texts.In our case, we have two corpora: the first corpus of more than 70,000 scholarly texts that are all bibliographically processed, meaning they are assigned with UDC number or numbers by a human expert.The second corpus consists of 200,000 old and bibliographically not processed texts (articles, short notices, popular texts) that were mostly published between 1850 and 2000 in journals and newspapers in the Slovene language used at that time.Slovene is morphologically one of the more difficult/rich languages and, in its history, it has had several variations (i.e.bohori cica, metelj cica, dajn cica) which makes it difficult to compare with language in scholarly articles written in the present day.Consequently, we had to pay extra attention when cleaning the texts, using "lemmatisation"-acquiring the base of the words and replace old vocabulary with that used today.We used two approaches in the cleaning process, as well as one already built model-FastText.
(1) Minimal processing -MP (we only retained words that were alphabetic (containing only alphabetical characters) and are lemmatised using a file that contains the root form (lemma) and all the words that share this root.If the word does not exist in the root dictionary, we left it in its original form.) (2) Regular processing-RP (similar as above, plus we also removed "stop words" and words that we do not want to include in the list of words; mostly irrelevant words).We also removed all words smaller than three characters and those whose root or lemma was not in our dictionary.
(3) FastText-FT.The FastText model can be seen as a shallow neural network that derives its capabilities by scaling up the number of learnable vector embeddings of n-gram features that are fed into the network (Agibetov et al., 2018).Its database contains data from Facebook and is translated into more than 150 languages.
The ML phase can be done with different programming languages, such as Python, R and similar.Like Jimenez-Marquez et al. (2019), we used the NTKL (Natural Language Toolkit) Figure 1.

The process of automatic text classification
The Universal Decimal Classification with Python, which offers easy-to-use interfaces and language resources, such as WordNet, along with a collection of word-processing libraries for sorting, tokenisation, perception, marking, parsing, semantic reasoning and similar (Bird et al., 2009).
3.1.1Process of clustering (unsupervised learning).First, we conducted a clustering analysis (Figure 1, before structuring phase), using a k-means algorithm (Colavizza and Franceschet, 2016;Jain, 2010), to test whether UDC classification of the scholarly texts adequately represents the natural groups identified within the texts.Since the unsupervised learning works on unlabelled and uncategorised data, we could take this approach before structuring the data.The goal was to find useful insights from the data.For clustering, we used the complete data set of more than 70,000 scholarly texts, which is presented in detail in the following text.
Clustering is a type of unsupervised ML, in which an algorithm seeks for similarities in a data set without the supervisor assigning or disregarding labels.As can be seen in the work of Romanov et al. (2016), the distribution of academic articles in UDC classifications in the UDC top-level library catalogue (0-9) may vary.Some areas of the UDC are better represented in the articles than others, there is a greater number of articles in a category and some categories are less represented.In order to avoid complications in model building and testing for imbalance, we ensured equal representation, equal distribution of texts across all UDC groups at the basic level.We randomly picked 900 scholarly texts, specifically 100 texts for each main UDC class.
We used this set to test for clusters distribution.We set the parameter k (as the desired number of clusters) to 73.We set the parameter k to this number because the sum of different UDC numbers in the set of 900 articles when we shortened the UDC to the second digit (821.16-Literature in Slavic languages become 82-Literature) was equal to 73.
Unsupervised learning identifies common characteristics of data and connects members with the same characteristics to groups, thus creating clusters.As part of the research, we used this type of ML to verify the grouping of input data into sets.We checked if the algorithms grouped related articles into the same groups.The results are presented in Section 4 and Figure 2.

Structuring
First, the vocabulary had to be "separated" into tokens or words.This process is called "tokenisation", which is a step in a process which that longer strings of text into smaller JD 77,3 pieces, called "tokens".Larger pieces of text can be tokenised into sentences and sentences into words.Tokenisation is enabled in Python programming language using the NLTK library and the word_tokenize function (Jimenez-Marquez et al., 2019;Ramdass and Seshasai, 2009).After that, we apply lemmatisation to obtain the dictionary form of words.For the old text corpus, we implemented another step: synonyms replacement.Since the old corpora also use anachronisms, we had to "translate" them into their current form.We used a dictionary with old words and lemmas to convert old words into their current forms.To make tokens useful for the classification, we had to convert them into numbers, which is the process of vectorisation.Similar to the work of Farkas et al. (2010) and Zhang et al. (2016), we used the Tfidf method to model the matrix of word vectors appearing in texts.
When building feature vectors from texts, we did not use term frequency but inverse document frequency.According to Aggarwal and Zhai (2012), in general, a common representation used for text processing is the vector-space based TF-IDF (term frequencyinverse document frequency) representation.In the TF-IDF representation, the term frequency for each word is normalised by the IDF.The IDF normalisation reduces the weight of terms, which occur more frequently in the collection, which reduces the importance of common terms in the collection, ensuring that the matching of texts is more influenced by that of more discriminative words that have relatively low frequencies in the collection.The result is sometimes called "the semantic vector" (Jalil et al., 2016).The result of vectorisation is the mxn array, which stores a dictionary of words appearing in texts.After the vectorisation step, the vocabulary (an array with vectors representing the articles in n-dimensional space) was ready for the model implementation.

Model implementation
In this step, we had to perform three key tasks: splitting the data set to train and test set, selecting an algorithm and fitting the model.For this purpose, we split the data set, consisting of more than 70,000 scholarly texts, to train and test subsets in the ratio of 80/20 (57,039 instances used for training the classifier, and 14,299 instances used for testing the classifier).After the step of dividing the data set into learning and testing, we used algorithms to build ML models.Each algorithm built its model on the training data set, which was later tested with test data.
The data set of old texts consisted of more than 200,000 articles.Validation of the performance of the classification model (Figure 1, fifth step) was done by human experts: 15 librarians who assessed the automatic UDC classifier on 150 randomly selected texts, since they were never classified by librarians.
ML algorithms are described as learning a target function (f) that best maps an input variable (X) to an output variable (Y): Y 5 f (X).The main goal of ML is to learn the mapping Y 5 f (X) of Y prediction for a new input X.This can be called predictive modelling, for which the goal is to obtain the most accurate prediction possible.Fitting the model is making the algorithm learn the relationship between predictors and outcome so that prediction for the future values of the outcome is possible.

The Universal Decimal Classification
To assess the performance of the trained text classification model, for which the target variable can have two or more classes, the measures of Precision, Recall, and F1 score are used.The following Equations ( 1)-( 4) are required to determine the metric values of the confusion matrix (Abdelaziz et al., 2018;Joo et al., 2013).

Model verification
Normally supervised learning techniques are used for automatic text classification, in which pre-defined category labels are assigned to articles based on the likelihood suggested by a training set of labelled documents (Baharudin et al., 2010).In supervised ML, our task was divided into two phases.Learning and testing 70,000 scholarly articles (UDC was available in the bibliographic catalogue) and use the trained model to classify 200,000 old non-scholarly articles.In the first phase, we tested 20% of the scholarly texts (test set), which amounted to 14,299 articles.We used classification accuracy (CA), Recall, Precision and F1 measures.The following Equations ( 1)-( 4) are required to determine the metric values (Abdelaziz et al., 2018;Joo et al., 2013). Where (1) Accuracy-the ratio of correctly classified cases to all cases of the observed set.
(2) Precision-the ratio of correctly classified positive cases to all positively classified cases of the observed set.
(3) Recall-the ratio of correctly classified positive cases to all positive cases of the observed set.

Results and discussion
In the following, we report the results of the clustering analysis, classification model building and testing on newer scholarly data and use on the older texts.

Clustering of the scholarly articles data set
In the bibliographic catalogue, all the articles are classified with the usage of UDC; therefore, it was straightforward to check whether the naturally occurring clusters in the scholarly corpus are aligned with the assigned UDC class.Figure 2 shows articles in 73 different clusters (articles represented by dots, colours representing clusters) (abbreviated to the first two digits of UDC, there were 73 different UDC in 900 articles).Articles with similar content tend to be closer to one another and thus form clusters, which are graphically represented by dots in 3D space.
Zoomed-in detail on the right of Figure 2 shows the cluster (or a group), one of the 73 groups.This is an example of a "clean" group, since elements tend to stick together.In this group, the articles contain content about Christianity (UDC 5 27).Most significant (TF-IDF) words from articles in this group are "Church, man, life, God, faith, council, God's, question, etc".More than 90% of all articles/elements in this group have UDC 5 27 (in the bibliographic catalogue).Another example of a very clean group is the one in which the articles have UDC 82 (literature).Most calculated significant words in this group are: "song, time, sky, eyes, heart, Earth, sun, wind, water, children, love, night, etc.".We used the clustering method to explore the input data and assess whether we can use this data to conduct the classification training.In so doing, we checked the relationship between publications and their UDC rows on the one hand and clusters on the other on the operation of the k-means algorithm.As can be seen in the example on the right side of Figure 2, clusters were homogeneous.Also, it is true that each publication can have more than one UDC number and should therefore appear in multiple clusters (there is high relation between religion and architecture, so the article should be in both clusters, but it is only in one).The disadvantage of using only unsupervised learning methods is that the system operates only with unmarked data during learning and has no insight into the correctness of the results; its objective is to identify hidden structures in unlabelled data (Vakharia et al., 2015).The use of unsupervised learning methods on a small set of articles (900) served in a research analysis.

Classification model building for scholarly texts
The complete corpus of 70,000 and more scholarly articles was divided into training and testing sets, 80% instances used for training (57,039 texts) and the remaining 20% instances used for testing (14,299 texts).Hypothesis 1 states that it is possible to build such a classification model on the corpus of scholarly texts, which would assign any randomly chosen publication from the test set to at least one appropriate UDC group with the probability of at least 0.8; it was tested using CA, Recall, Precision and F1 measures.The results of classification algorithms performance on the test set of scholarly texts are presented in Table 1.The Universal Decimal Classification The results obtained for the classification of scholarly texts corroborate Hypothesis 1.As evident from Table 1, at least 80% of articles are accurately classified into an appropriate UDC group.The best performing classifier, according to the classification algorithm, is SVM using Tf-idf (CA 5 0.963).When the number of features (i.e.individual measurable characteristics of a subject being observed) is low, SVM, LogReg and MLP algorithms perform better than NB or KNN (Colas and Brazdil, 2006;Musa, 2013;Zanaty, 2012).

Using the classification model to classify old texts
Next, we tested Hypothesis 2, stating that the classification model trained on the corpus of scholarly texts can be used to classify the corpus of old texts.The old texts are not fully bibliographically processed (texts are merely assigned the UDC of the parent's bibliographic record-magazine or newspaper), often poorly structured, written in archaic Slovene language and vary in length compared to the scholarly texts.The vocabulary of the language in the 19th and early 20th centuries, compared to the present-day language is quite different because the vocabulary itself and language of each nation are changing over time.The complexity of the sentences and the choice of words are also different than in the academic literature.By reviewing the article, the librarian can give an opinion on the correctness of the classification given by a classification model.
Therefore, the main challenges of this task were the fact the language that had changed significantly within one century and the fact that the principle of writing scholarly texts is considerably different from the writing of popular texts.For this reason, we used a dictionary and translated old Slovene words (where possible) into the language used today, in order to make the classification algorithms work better.If we had a large corpus of bibliographically processed old texts, we would have used it for a learning set and probably would have achieved even better results.
Once we had confirmed Hypothesis 1, we used the complete corpus of scholarly texts (more than 70,000) as a training set, following the basic idea that the classification model's performance is better with a larger set.We then used this classification model to classify the 200,000 old texts.The trained classification models served us determining UDC for old texts.
The algorithms have placed the articles in one or more different UDCs.For further analysis, we considered only UDC numbers with at least 10% probability to fit in some UDC group by the classifier and sorted them by probability, for example, "KNN: [("821", 0.59), ("7", 0.19), ("929", 0.12), ("111", 0.1)], where the first number in the bracket is the UDC group, and the second is the probability of correct placement in a group, in this example calculated by KNN algorithm.
A total of 150 randomly selected articles, automatically assigned with UDC classes, were evaluated by 15 librarians (each librarian had the task of evaluating 10 randomly assigned texts).The text and the results of the automatic classification according to three different text processing and vectorisation approaches (minimal processing, regular processing and FastText) were available for each of five classifiers (Naive Bayes classifier, Support vector machines, Multilayer perceptron, Logistic regression and k-nearest neighbours algorithm).
For better understanding the work of the human experts, we present a few examples below.Librarians labelled calculated UDC numbers with green if the proposed UDC by the classifier is appropriate and with yellow if the proposed UDC by the classifier is appropriate in a broader context.Proposed UDC numbers that were not labelled were not appropriate for the article that was processed.An important difference between UDC numbers assigned by librarians (for the entire journal) and the articles reviewed is that the classifiers mostly "found" and "suggested" UDC numbers that described the article by content and not only type (e.g.070 -Newspapers.The Press.Journalism, 050 -Serial publications, periodicals (as subject)).We explained this in more details by giving three examples.
Electric robot (article, translation from Slovenian language): "In Czechoslovakia, a mechanical device was introduced, which automatically regulates the lighting and switch-off of the light.The robot consists of two cells.The first one is in the building of the city power plant, and the other on the transformer, as soon as it is in the evening, it reacts both cells to the change of light by burning everywhere electric light." Since the article is very short, it seems important not to discard too many words from the corpus.In Tables 2 and 3, we display the results of automatic classifiers where minimal and regular processing was done.In Table 4, we display the results of an article that was processed with FastText.
UDC numbers for the entire publication (from the record in the library catalogue) are: (1) 070 Magazines.Print.Journalism (2) (497.4)Slovenia.Republic of Slovenia (3) "18/19" 19/20.century Calculated/suggested UDK numbers accepted by the librarians: (1) 007 Activity and organisation.Communication and control theory in general (cybernetics)."Human Engineering" gives us an average of p 5 0.57 and seems more credible than probabilities below p 5 0.2.The classifiers perform surprisingly well in old Slovenian texts.An example from "Kmetijske in rokodelske novice" shows us so.Agricultural and handicraft news was published weekly by Janez Bleiweis.They were first intended to help farmers and craftsmen, and later published articles in the fields of fiction, conservative politics, culture and letters from various places.They were important mainly because of the consolidation of the Slovene literary language, the general acceptance of the "gajica" language and, in general, the comprehensive cultural development of the Slovenian nation.The newspaper continues under the headline "Novice kmetijskih, rokodelnih in narodskih re ci".
For example, we took a very short text whose contents are shown in Figure 3.
Forecast of agricultural books (article/notice, translation from old Slovenian language): "Forecast of agricultural books, For sale in Ljubljana at the bookstore Mr Lerchar on the big square: Krajnski Vrtnar, or teaching to grow many fruit trees in a short time, to ennoble them by grafting, and plant beautiful gardens with great benefit.The Imperial Royal Society of Agriculture in Carniola was brought to light.Written by Franz Pirc, the pastor at Sv. Jernej in Lo ce.In Ljubljana 1834-1835.Price 24 crowns." Tables 5-7 show the calculations of the classification results for the text "Krajnski Vertnar" The Universal Decimal Classification The green bar above the identifier of an article presents the count of appropriate UDC number or numbers, confirmed by the human expert (librarian).The yellow bar presents an appropriate UDC number in a broader context, as stated by librarians.As we can see in Figure 5, most of the articles were approved with at least one UDC number by a librarian.We should take into consideration that it is not necessary that every librarian would label or choose the same UDC numbers for the same articles.Even in real life, in the cataloguing process, classification by librarians varies.In the research by Marijan and Leskovar (2015), they showed that the work that includes human decision-making could vary from decisionmaker to decision-maker.Librarians are independent in their work, in determining the UDC numbers, in the process of cataloguing.For this reason, they reviewed and assigned UDC numbers for different sets of texts in the present study.
The calculation of the average appropriate or appropriate UDC numbers in a broader context for 150 articles is: (1)   The Universal Decimal Classification (3) For four texts, only UDC numbers approved in broader contexts were accepted by librarians (4) The minimum value for approved UDC per article was 0 (in four cases of articles/ texts) (5) The maximum value for approved UDC per article was 6 (in one case of article/text)

Conclusion
We addressed the problem of automatic classification of old texts into UDC classes using a classification model trained on the corpus of newer scholarly texts.For this purpose, we prepared two corpora: the first of the newer scholarly texts and the second of older texts.The newer texts were fully bibliographically processed, well-structured (in an academic manner of writing) and with use of current language.We analysed this corpus using clustering k-means method to confirm the alignment of naturally occurring groups with the UDC groups assigned by human experts: librarians.The corpus was then used for classification model building.The built model was used for classification of older texts, for which labels of UDC classes existed only for the parent publication, meaning all the texts were classified into one category regardless of about what the content of the text was.Therefore, our main goal was to assign categories to these texts automatically according to their content.We set out to test two hypotheses.The first hypothesis, stating that an efficient classifier trained on the newer scholarly texts, can assign a correct UDC class in more than 80% of cases, was supported by the results (classification accuracy for SVM algorithm was 0.9).The second hypothesis could not be tested by using performance measures such as classification accuracy, as the data set was not labelled.We could only rely on the human expert's opinion, which is the main limitation of this study.We randomly selected 150 automatically classified texts and distributed them to 15 librarians (human experts).The task of each librarian was to evaluate the automatic classification for 10 assigned texts.The results, reported in Figure 5 and explained by examples in Tables 2-8, suggest that the classification models can be used for automatic classification of older texts.Among 150 texts selected for evaluation, there were more than 90% correctly classified into at least one UDC number.
Furthermore, all the 150 texts were accurately classified in at least the broader scope of UDC.The set of automatically assigned UDC numbers, confirmed by human experts, were of such quality that the librarian was able to choose a replacement UDC number (instead of the one assigned for the entire journal) for any given article at least in a broader context.Since the available group of articles was not as big as we would like (e.g.200,000 scholarly articles evenly distributed across all UDC groups), some areas are much more represented than others, especially if we go in-depth with UDC classifiers.This fact is definitely a limitation of this research.Nevertheless, the research shows a model that can be implemented for other areas and related classification schemes, or related approaches only the main table is used, because the learning set from which we built classification models is too small in our corpus.
In practice, this means that classification models can support the librarians in their daily work as a recommendation system for bibliographical processing.With the help of the research findings, it is possible to create assistance to the cataloguing process and offer the librarians UDC numbers they may have overlooked.As stated by other researchers (Beel et al., 2017;Porcel et al., 2009), there are several approaches for (paper) recommendations for users by librarians, but our research can help librarians themselves in the phase of cataloguing.In addition to this, automatic classification contributes to a better experience for the end-user as well.The retrieval of the categorised old texts through digital libraries and web portals will offer new qualities in accessing the content, categorised by topic, category or subject of those articles equipped with new information.Thus, additional functionalities can be implemented, such as additional filtering (by topic) and consequently with reducing the time required to search for the data.
This research contributes to the body of knowledge on bibliographic recommendation systems, particularly in the field of old digitised texts that are brought to the public through digital libraries.To the best of our knowledge, there has been no such an attempt described in literature yet.By supporting both the end-user and the librarian, in their work with this accumulated human knowledge, we hope that we serve the society as well.
For the future, we plan to validate the classification models on a data set of bibliographically processed old texts.We also believe that equipping of old texts with metadata (such as UDC numbers) can be further enhanced by implementing the wisdom of the crowds.As concluded by Nguyen et al. (2018), the goal is to obtain a sufficient amount of quality data.Existing systems for "mass outsourcing", or.crowdsourcing are usually based on one of three social network structures, as reported by Silvertown et al. (2015): (1) The contributions of all participants have the same weight.
(2) A recognised expert connects and verifies data from the contributions of other users.
(3) The structure is based on the fact that no one can be an expert in identifying all taxonomic groups.Each person's contribution has a different weight, depending on the community's ability to contribute and feedback from the community.
It is probably unlikely to bibliographically process a corpus large enough of old texts to be used for automatic classification with quality more than 99% for not-widespread languages.Most likely, artificial intelligence, more precisely ML, will be the main player that will help us to achieve most of the core in this work.

Table 5 .
On average 1.8 approved UDC numbers per text were confirmed by librarians (2) On average 2.55 approved or approved in broader context UDC numbers per text were approved by librarians