Digital Korean studies : recent advances and new frontiers

Purpose – This study aims to reflect on the past and prospects of digital Korean studies. Design/methodology/approach – Discussion includes the remarkably early adoption of computing in the Korean humanities, the astounding pace in which Korean heritage materials have been digitized, and the challenges of balancing artisanal and laboratory approaches to digital research. Findings – The main takeaway is to reconsider the widespread tendency in the digital humanities to privilege frequentist analysis andmacro-level perspectives. Practical implications – Cha hopes to discover the future of digital Korean studies in semantic networks, graph databases and anthropological inquiries. Originality/value – Cha reconsiders existing tendencies in the digital humanities and looks to the future of digital Korean studies.

modern digital archives and computing power in the study of premodern Korea.In 2018, South Korea boasts large collections of heritage materials captured, archived and curated, using cutting-edge database technology.These databases have been made publicly downloadable under a government-mandated open license policy.This unusual situation, to my knowledge not found in any other area studies discipline, demands that Korea specialists think creatively and reflectively about the implications of having access to such a staggering amount of high-quality humanities data.How should these repositories be structured, curated, and preserved?In what ways do our existing interpretations of Korean history and culture change due to digitization and digital methods?What can digital Korean studies teach us about the advantages and limits of data-driven humanities?What would be some effective ways of incorporating images, audio recordings, aerial photography, and 3d scans of artifacts in the Korean humanities?
The field of Korean studies, especially outside of South Korea, has given surprisingly limited attention to computing and digital methods.This lack of interest stands in stark contrast to the South Korean government's massive investments in the building of archives and the attendant dependence of Koreanists on digitized source materials.At a recent Association of Asian Studies annual meeting, the inaugural digital humanities working group meeting was held; I found myself the only specialist of Korea in a room with approximately 60 attendees.I also turned out to be the only presenter covering Korea in the DH 2018 conference in Mexico City [2].
Nonetheless, there are some encouraging signs.Korean studies librarians have been paying close attention to South Korea's digitization efforts and the digital humanities [3].In South Korean academia, the Korean Association for Digital Humanities (Han'guk tijit'ŏl inmunhak hyŏbŭihoe 한국 디지털 인문학 협의회) was established in 2015.In a few pockets, such as the Academy of Korean Studies, Hankuk University of Foreign Studies and Ajou University, the digital humanities have been gaining traction, albeit slowly (Kim, 2016, pp. 385-388).In 2016, a solid primer articulating a long-term vision, technical challenges, global comparisons, pedagogy and reflections on failed efforts was published (Kim et al., 2016).To make sense of what "digital humanities" means in the South Korean context, however, it is important to keep in mind the legacy of a domestic phenomenon called "cultural contents studies" (Munhwa k'ont'ench'ŭhak 문화콘텐츠학).South Korea's impressive range of digital archives was funded primarily to foment the media industries, such as K-pop, television drama, cinema and video games, not necessarily to promote new types of research in the humanities.
The Korean case demonstrates that the availability of digitized materials, no matter how large in scale and how high in quality, does not spontaneously lead to explorations in new modalities enabled by digitization and digital technologies.Digital projects entail more than feeding big cultural data into a computer and expecting a groundbreaking result.Many, if not most, end up as failed experiments and lead researchers down unexpected and unforeseen paths.Months and years of training and tedious work are needed to produce meaningful outcomes and mature studies, which also necessitates cross-disciplinary cooperation with informatics, computer science, statistics and other relevant fields.In addition, digital humanities scholars need to be prepared to learn and embrace the aesthetic and storytelling aspects of digital media, such as photography, videography, graphic design, 3D modeling and animation and game engines.
Navigating the uncharted waters of digital Korean studies appears less daunting once we realize that the study of Korea's past has been influenced by modern digital technologies in more ways than generally recognized.Because of digitization, expectations have changed and continue to change.In 2006, for example, the release of A Compendium of Korean Collected Works (Han'guk munjip ch'onggan 韓國文集叢刊) as an online database was a groundbreaking moment for many historians of Korea.Only a portion of the 1,259 collected DLP 34,3 works of premodern Korean intellectuals in the current database was made available then and the user interface was basic by today's standards [4].Yet, I found this resource invaluable.I gained access to an expensive collection of primary sources that the University of British Columbia at the time could not afford to have in its library stacks due to budgetary and spatial constraints.Research that used to require three hours of driving from Vancouver to the University of Washington in Seattle or three weeks of waiting for one volume among hundreds to arrive on interlibrary loan could be done anywhere, as long as I had access to a computer connected to the internet.Accordingly, I was no longer satisfied with reading this collection one volume at a time, and I read the sources more extensively than before.As I got used to this online database's various keyword search functions, I experimented with a new approach that would have been difficult to execute without digitization: broad, comparative analysis of passages associated with a relatively obscure concept, covering the works of about two dozen scholars.
Fast forward to 2018: I now have this entire database on my notebook computer's solidstate drive as a 500-megabyte Unicode text file.The 18,398 raw XML files that make up this database can be downloaded legally and free of charge, under the terms of South Korea's Open License for public data, on the National Information Society Agency's Open Data Portal.With access to the entire database, my expectations are changing yet again.I look at the 557,126 pieces of writing consisting of 154 million characters in the 2017 version of this database and wonder what new type of research may be possible.Just to start, I have tried to run topic models, map out semantic patterns and algorithmically classify the authors and writings on the basis of diction, style and figures of speech.
To make sense of where digital Korean studies originated, where it stands today and where it is headed, we should recognize the foresight of predecessors who were ahead of their time.Computing in Korean studies had a remarkably early start owing to the pioneering efforts of Song June-ho (MR: Song Chunho, 1922-2003) [5] and Edward Wagner (1924Wagner ( -2001)).In 1959, Wagner was appointed as an assistant professor of Korean history at Harvard University, upon finishing his doctoral dissertation on the political history of fifteenthand sixteenthcentury Korea at the same institution.He spent the initial years of his appointment laying a foundation of research and teaching about Korea in the United States, starting with the publication of a textbook in written Korean language in 1963, that would grow into a threevolume project over the years (Wagner, 1963(Wagner, /1971)).His next project was to be a multi-volume introductory history of Korea composed of contributions from experts based in North America and Europe.That project never materialized, however, and he decided to pursue something else instead.In 1964, Song June-ho of Chonbuk National University, also a historian of Korea, visited the Harvard-Yenching Institute (Plate 1).During his year-long stay, Song persuaded Wagner to analyze the elite structure of the Chosŏn 朝鮮 dynasty (1392-1910) using the roster of civil service examination degrees, called munkwa 文科.At the time, Song and Wagner did not realize that what later came to be known as the Munkwa Project would involve computing, nor that their collaboration would last for nearly 40 years.In 1966, Song began collecting portions of the examination roster in Japan[6], and a successful Ford Foundation grant application in 1967 officially initiated the Munkwa Project with the promise of compiling a database.This development in Korean studies took place during what the French historian Emmanuel Le Roy Ladurie called the "American challenge" in reference to the vigorous push in the United States to install computers on university campuses (Ladurie, 1979(Ladurie, /1968, p. 6), p. 6).
The Munkwa Project inspired the creation of other digital archives, but the resources required for such projects became abundant as the Government of South Korea's response to the 1997 Asian financial crisis included digitization.In 1998, an economic stimulus program called the Informatization Labor Project (Chŏngbohwa kŭllo saŏp 정보화근로사업) initially Digital Korean studies allocated approximately USD $200m over two years to create 48,000 white-collar jobs involving the digitization of cultural heritage (Kim, 2012, p. 601;Cha, 2015, p. 139).The National DB super-collection, an outgrowth of that initiative, now lists thirty-one databases categorized under "history" (yŏksa 역사), another set of thirty-one under "culture" (munhwa 문화), and 16 under "education" (kyoyuk haksul 교육학술) [7].In addition to these 78 obvious candidates, several others filed under "science and technology" (kwahak kisul 과학기술) and "industry and economy" (sanŏp kyŏngje 산업경제), such as climate, public health, science and technology magazines, satellite images and bags of words of Korean and major world languages, should be of interest to humanists and social scientists as well.Nearly USD $1 billion has been collectively spent on National DBs since its inception as the Informatization Labor Project twenty years ago.Large amounts of public funds are continuing to drive the building of big databases.In 2017, the National Institute of Korean Language was granted USD $17.  1).This innovative design allows the researcher, for example, to query the information regarding 336,267 official career records mentioned in the court records with the ability to provide citations for each and every entry (Figures 2 and 3).A simplification of this XML schema was ported to the aforementioned A Compendium of Korean Collected Works and used to structure the writings of 1,259 authors (Figure 4).Moreover, digital archives in Korean studies are remarkably up-to-date in database and content-management technologies.Beyond XML, Kim Hyeon has avidly adopted, promoted and experimented with the wiki platform, the semantic web, and Neo4j's GraphDB, among others.Many recent academic projects related to Korean studies and digital humanities at the Academy of Korean Studies, such as the annual training workshop in reading and translating literary Chinese sources and a recent conference on digital storytelling, are  The data set consists of approximately 596,000 nodes and 767,000 edges, waiting to be used by researchers (Figure 5).Digital Korean studies is entering the realm of big data.To handle digitized archives of growing size and complexity, Korean studies will need to consider transitioning from an artisanal to a laboratory mindset [13].Yet, Koreanists still overwhelmingly prefer the artisanal approach of producing specialized and detailed case studies of individuals, local communities and institutions by excavating previously unstudied or understudied materials.The belief is that the accumulation of research conducted in this manner eventually will result in a new understanding of Korea's past, one that is more objective than previous interpretations and more faithful to what the primary sources show.Collectively speaking, artisanal Korean studies is grounded in the assumption that case studies follow a normal distribution of the individuals, topics and local communities that make up the Korean peninsula and the Korean diaspora.Some influential figures, famous communities and important topics may receive more attention than others, but overall such unevenness can be corrected.
Today's age of big cultural data, however, turns this assumption on its head.The actual picture is far more skewed than what we might imagine.Along with the digitization of primary sources, secondary studies have been made available in digital form via three major service providers: KISS, DBpia and RISS.KISS has built a database with 1,387,413 articles since 1996 [14]; DBpia has accumulated 2,221,278 articles, 19,630 e-books and 31,916   I).Overall, 16 authors made up half of all case studies available on KISS, and the 80 per cent mark was reached with only 77 authors or 14.5 per cent of 531 authors.The most disconcerting finding of this exercise is the long trail of zeroes: 248 authors, or 46.7 per cent, have yet to be the subject of Intervention is necessary.The artisanal approach has evidently resulted in a highly uneven picture of Korea's past.What should be done?What needs to be done?My initial response was to experiment with macro-level text analysis.Along with a quantitative sociologist, I tried to run topic models and identify latent patterns in the data.This turned out to be more problematic than we had anticipated.Unlike the relatively consistent corpus that made possible Jockers ( 2013) "macroanalysis" of British, Irish and Irish American literature, Korean collected works consist of a mix of various prose and verse forms that appear deceptively to be uniform due to our prejudices about the timelessness of Sinitic writing [17].In addition, we ran into segmentation issues: poetry in Chinese and Korean could not be broken down into meaningful units and prose in literary Chinese had no parser readily available.Most of the results of our topic modeling attempts returned gibberish.Segmenting the writing into 2-grams, 3-grams and 4-grams using dictionaries helped somewhat but did not address the fundamental challenges of working with the peculiar features of our data set (Figure 8).
As an alternative strategy, I attempted what Jockers did with function words on Englishlanguage corpora (Jockers, 2013, p. 65), but on a small subset of literary Chinese prose written in the neoclassical mode, called Tang-Song "ancient prose" style (Korean komun/ Chinese guwen 古文) in East Asian literature.With the input of a Korean literature specialist, I loaded the prose writings of four early seventeenth-century literary masters: Yi Chŏnggwi 李廷龜 (1564-1635), Sin Hŭm 申欽 (1566-1628), Chang Yu 張維 (1588-1638) and Yi Sik 李植 (1584-1647).The four masters, known as Wŏl Sang Kye T'aek 月象谿澤 by the initials of their pen names, are renowned for their elegant prose in the Tang-Song neoclassical mode.However, Yi Chŏnggwi and Yi Sik have been suspected of being influenced by a contemporary literary trend that became fashionable in Beijing: archaism (Korean pokko/Chinese fugu 復古) or Old Phraseology (Korean komunsa/Chinese guwenci 古文辭) (Bryant, 2008;Chang, 2010, pp. 28-36;Rho, 2015;Ong, 2016).Simply put, archaists sought to make neoclassical prose "truly" ancient and one of the ways to do that was to make prose more "lyrical" by suppressing the use of function words and grammatical  particles.Thus, an unusually low occurrence of some of the most common function words and grammatical particles in literary Chinese prose could be interpreted as a sign that the author might have been under the influence of sixteenth-century archaism.While the method should be refined, a preliminary analysis of Wŏl Sang Kye T'aek prose pieces on MARKUS has revealed that Yi Chŏnggwi and Yi Sik indeed show a tendency to suppress the use of common function words and grammatical particles, at 9.9 and 9.7 per cent of all characters, respectively, compared to 12 to 17.7 per cent shown in the writings of other authors not suspected to have been archaists (Table II).Parenthetically, I attempted this exercise on Voyant but realized that the algorithm segmented the text into characters and words using, what I suspect was, a modern Chinese language parser.I ended up with erroneous results and there was no way to switch off the automatic parsing.
Frequentist analysis of humanities data can be useful.However, its limits should be acknowledged as well.Perhaps the strategies for identifying authors by their habits and influences can be scaled by using existing software tools and creating algorithms that automate the analyses and comparisons.The trouble is that the database itself is skewed.Those among the 1,259 authors whose case studies are overrepresented happen to have a large number of their writings preserved.On a timeline, the overrepresentation is concentrated in the seventeenth and eighteenth centuries.In A Compendium of Korean Collected Works, the total character length of writings before 1375, the bin range that roughly corresponds to the dynastic change from Koryŏ (918-1392) to Chosŏn, is 758,687, or only 5 per cent of the 150 million characters that constitute this database.The application of methods such as principal

Digital Korean studies
Song June-ho and Edward Wagner intended to do with the Munkwa Project.I have also come to develop genuine appreciation for the value of Kim Hyeon's preferred mode of developing digital Korean studies using wikis and semantic webs.The impressive digital infrastructure available to Koreanists drives the temptation to go for top-down, omniscient observations.However, I would argue that seeing "everything" is the easy part.Song and Wagner managed to computerize 14,600 records and aggregated the data by categories such as address, choronym, data and exam performance in only two or three years, without the power and ease of today's digital technology.The Munkwa Project ended up being an unfinished 40-year-old enterprise because seeing "everything" in this sense was not the goal.Song and Wagner sought to examine the rich contours of premodern Korea's elite structure by linking the exam degree roster with the vast pool of information stored in genealogies.
Similarly, Kim Hyeon's vision is to organize the existing knowledge base of Korean studies in network representations and to create digital environments that encourage scholarly collaboration.As someone who has been involved in numerous digitization projects, Kim Hyeon had many opportunities to seek omniscience.Yet, he has shown little interest in such endeavors.Why? Digital projects inspire researchers to try strange things.In 2011, Kim Hyeon obtained a pilot license (Kim, 2012).What motivated him to fly?In his words To create hypermedia contents that vividly capture Korea's local cultures, I decided to grab the control stick of a light aircraft.I did this for the same reasons I became a programmer and held a camera for the first time.At first, I couldn't help but laugh.Even I thought I was taking things too far [. ..].However, as I pondered this issue for three or four months, my reasons for flying became clearer.(Kim, 2012, p. 828) Table II

Digital Korean studies
Today, I use a sub-$1,000 drone to do a task that not long ago used to require a pilot license an airplane.I fly a drone for the same reasons that brought Kim Hyeon to the sky on an airplane: exploration and discovery.The impressive digital infrastructure in Korean studies has made it possible to write an entire article without having to leave my desk.Yet, I have found myself, more than ever, actively engaging in field work, which is unusual for a specialist of the medieval and early modern periods.When I visit Andong, Chinju or Tamyang, I realize how little I know about the environment in which my historical subjects lived.I want to be in the same location where the local poets' societies gathered.I want to know the common routes that connected the various settlements in the area.I want to experience life in the region during different seasons.
Most importantly, I want to see their world, despite the gap between my time and their time, from as many different angles as possible.This desire has propelled me to develop a serious interest in photography and videography and to think about framing and capturing the world around me in field sites from multiple angles and using lenses of various formats and apertures.I also learned how to fly a drone to access macro-level perspectives of a different nature from those I get from running topic models on a textual corpus.Every time I have tried something, I gained new insights.
Fortunately, the field of digital Korean studies seems to be headed in this direction.That is, one which prioritizes bridging the gap between the life and times of our historical subjects and we modern-day researchers equipped with digitized archives, cameras, sensors and computing power.In this emergent paradigm, our collective pursuit is not omniscience but immersion and connections.Recently, I was asked to join a team of senior academics, graduate students and database experts to create a database of citations, classical references and various instances of text reuse in the collected works of Koryŏ authors.My initial reaction to this project was skepticism: why not use text-reuse algorithms to detect such text-reuse patterns?Gradually, I was sold on the beauty of this project.The final project proposal ended up consisting of specialists with a wide range of expertise and interests but united under a common goal: to explore and discover new meanings and connections in the sparsely surviving Koryŏ-era writings, which, as aforementioned, make up only 5 per cent of the A Compendium of Korean Collected Works database.No text-reuse algorithm is substitute for a room of experts who can distinguish, for example, whether a classical reference was made directly or by way of other reference materials or the works of influential Chinese figures.As South Korea's full-fledged effort to digitize cultural heritage enters its twentieth year since the Informatization Labor Project during the Asian financial crisis, there have been concerns that we are "running out" of Koryŏ-era materials to digitize.The idea of building text-reuse databases shows an alternative path that could become a model for digital humanist scholars of other periods and other parts of the world.The nextgeneration databases for digital Koreanists will have the potential to showcase a new kind of concordance through which we can map the links and flows in the transformation of cultures over time.To do this, what digital Korean studies needs is not simply a shift from a field consisting of artisans to a field consisting of laboratories, but a field consisting of many laboratories of artisans.The field particularly needs eccentric artisans who might show up to field sites in airplanes.
5 million over five years to create 15.5 million bags of words representing the modern Korean language for AI-driven linguistic analysis[8].At the Institute for the Translation of Korean Classics, USD $20 million is being invested annually to train a deeplearning model for translating The Diary of the Royal Secretariat (Sŭngjŏngwŏn ilgi 承政院 日記).Prior to this, the digitization of the scribes' notes on the daily affairs of the early modern Korean court, covering the years from 1623 to 1910 in 242 million Sinitic characters, took 15 years, from 2001 to 2015[9].This painstaking task required the deciphering of documents written in cursive and shorthand forms.The next step of rendering the literary Chinese content into modern Korean is projected to take at least 45 years with the compensation for the necessary specialists limited to a meager USD $15 per page due to the project's scale[10].Using the new deep-learning approach, the estimated time for completing the translation has been reduced to 18 years, at an annual cost of what one journalist Plate 1. Edward Wagner and Song June-ho at a Buddhist temple near Chŏnju in 1970, along with their the equivalent of "the asking price of a single apartment unit in Gangnam [MR: Kangnam][11]".In addition to benefiting from an early start and large-scale funding, South Korea's machinereadable archives tend to be of remarkably high quality.For this, digital Koreanists are indebted to one of the trailblazers: Kim Hyeon (MR: Kim Hyŏn), who currently heads the department of humanities informatics at the Academy of Korean Studies.Originally a specialist of Korea's Neo-Confucian philosophy, Kim Hyeon's foray into humanities computing began in 1985 with a position in the Korea Institute of Science and Technology.His initial interest in informatics involved the encoding of han'gul 한글 (modern Korean phonetic characters) and hancha 漢字 (a set of Sinitic characters used in written Korean language).Throughout his distinguished career in humanities computing and digital humanities, he had the extraordinary ability to remain at once obstinate and free from dogmatism about technology.In the early 1990s, when CD-ROM emerged as the highcapacity storage medium of the future, he helped found a start-up company to produce the first-ever digital edition of the Annals of the Chosŏn Dynasty (Chosŏn wangjo sillok 朝鮮王 朝實錄), offering the ability to search through its 50 million characters of full text.During the years of South Korea's rapid expansion of digital infrastructure, his company transferred the ownership rights of this database to the National Institute of Korean History.Throughout this process and in a different capacity, Kim played a key role in reworking the archive's data ontology for the internet.The current online edition of the Annals of the Chosŏn Dynasty consists of 674 XML documents, with detailed annotations of every full-text entry (Figure Figure 1.A heavily XMLtagged Sillok entry Figure 2. The career history of Sŏ Kŏjŏng (1420-1488) generated via real-time query request on the XML files that make up The Annals of the Chosŏn Dynasty

Figure 3 .
Figure 3. On the third day of the first lunar month of 1473, Sŏ Kŏjŏng was the Chief Censor Figure 4.A portion of the digital edition of Sŏngsobu pugo, or the collected writings of Hŏ Kyun (1569-1618) Figure 5.The Chosŏn dynasty's royal family genealogy in GraphDB, which consists of approximately 596,000 nodes and 767,000 edges Figure 7.A Pareto chart of case studies that appear in the KISS database for Korean intellectuals born between 1450 and 1750 Notes1.I have introduced some minor changes to Peter Putnam's translation of this passage.2.The DH 2018 conference program is available at: https://dh2018.adho.org/en/talleres/DLP 34,3

Table I .
.