Technological competitiveness of China ’ s internet platformers: comparison of Google and Baidu by using patent text information

Purpose – This study aims to assess the technological capability of Chinese internet platforms (BAT: Baidu, Alibaba, Tencent) compared to US ones (GAFA: Google, Amazon, Facebook, Apple). More speci ﬁ cally, this study explores Baidu ’ s technological catching-up process with Google by analyzing their patent textual information. Design/methodology/approach – The authors retrieved 26,383 Google patents and 6,695 Baidu patents from PATSTAT 2019 Spring version. The collected patent documents were vectorized using the Word2Vec model ﬁ rst, and then K-means clustering was applied to visualize the technological space of two ﬁ rms. Finally, novel indicators were proposed to capture the technological catching-up process between Baidu and Google. Findings – The results show that Baidu follows a trend of US rather than Chinese technology which suggests Baidu is aggressively seeking to catch up with US players in the process of its technological © Kazuyuki Motohashi and Chen Zhu. Published in Asia Paci ﬁ c Journal of Innovation and Entrepreneurship . Published by Emerald Publishing Limited. This article is published under the Creative Commons Attribution (CC BY 4.0) licence. Anyone may reproduce, distribute, translate and create derivative works of this article (for both commercial and non-commercial purposes), subject to full attribution to the original publication and authors. The full terms of this licence may be seen at http://creativecommons.org/licences/by/4.0/legalcode


Introduction
The advancement of artificial intelligence (AI) (machine learning) could turn massive data from the internet and IoT sensors into a gold mine (Agrawal et al., 2018).AI technology is versatile and applicable across various industries (Trajtenberg, 2018;Motohashi, 2020).Not only does it improve the accuracy of predictions, but it also enhances the economy of scope in big data analysis.The nature of general-purpose technology of AI, or non-rivalry of big data for various applications, allows internet business firms to grow as internet platforms, expanding their services to a variety of industries (Goldfarb and Trefler, 2018).Accordingly, Google, Amazon, Facebook and Apple (GAFA) have become top-listed firms in stock market valuation ranking.
At the same time, the concentration of data into a small number of firms, such as GAFA, has raised concern among national authorities outside the US.Google has been fined a combined $9.5bn since 2017 by EU antitrust regulators, and EU regulatory bodies have kept a close watch on the activities of other US internet firms.The EU also imposes General Data Protection Regulation to ensure privacy protection of European standards when private data are transferred beyond EU borders.Such policy actions could lead to "virtual nationalism," where cyberspace is compartmentalized by nation/region (Economist, 2020).
In this regard, China is going its own way by virtually banning internet business on US internet platforms and international data transfer (Chorzempa et al., 2018).As a result, indigenous internet giants Baidu, Alibaba and Tencent (BAT) have emerged in a domestically segmented cyberspace insulated from international competition.Based on huge amounts of data from 800 million smartphone users, as well as large domestic markets in China, Alibaba and Tencent are listed in the global top 20 in terms of market capitalization.Recently, BAT have invested heavily in AI technology based on a large talent pool inside China.The Chinese Government plans to become a global AI leader by 2025, and BAT is supposed to play a crucial role (Biancotti and Ciocca, 2018).
This study focuses on Baidu and Google and assesses the technological capability of Chinese internet platforms compared to US ones.These two firms are quite comparable in terms of their business domain and advertising based on internet search queries, and both firms have recently made substantial investments in autonomous driving technology.We use text information (abstract) of patent applications submitted to the US Patents and Trademarks Office (USPTO) and CNIPR (China patent authority).The text information of patent data is assumed to reflect the content of the invention precisely.The similarity score of two patents based on the patent abstract provides more accurate information than their IPC code (Arts et al., 2017).In addition, the vector space model with a high dimension of continuous variables gives finer-grained information about patent contents, as compared to onedimensional IPC codes with discrete variables (Younge and Kuhn, 2016;Motohashi et al., 2019).
Understanding the technological capability of Chinese firms is important from the perspective of both business and policy.A firm in a developed economy, such as Japan, APJIE cannot conduct internet/IoT business in China by itself but needs to collaborate with local firms such as BAT.Under such conditions, it is critical to access the technological capability of Chinese counterparts as the bargaining position in partnership negotiation depends on relative management resources, particularly technological capacity, to which Chinese firms are eager to gain access.In addition, as tensions between the US and China due to trade disputes become intense, information on technological competitiveness in both countries is essential intelligence for policymakers in third countries.This is particularly the case for Japan as both countries are very important partners, and an inappropriate strategy to deal with them may cause substantial damage to the domestic economy.
The remainder of this paper is organized as follows.Section 2 reviews catch-up related literature and our research framework.Section 3 outlines the data source and methodology of our vector space model based on internet technology patents from USPTO and CNIPR.Google and Baidu patents are compared via two types of empirical analysis in Sections 4 and 5.One is an overview of the technologies of these two firms using clustering analysis.The other is based on a more micro view of individual patents, together with the distribution of patented technologies of its neighbors in the technology space.Finally, we conclude with a summary of the findings and policy implications in Section 6.

Technological catch-up and proposed research framework
The concept of catch-up possesses a significant and enduring historical legacy, marked by notable examination in Abramovitz's (1986) influential work.It achieved prominence in the post-Second World War period, characterized by the USA's early adoption of advanced methods of production and industrial practices that other countries had not yet embraced.According to this scenario, catch-up is commonly defined by economic scholars as the process of reducing the disparities in productivity and income between a leading nation and a trailing one (Fagerberg and Godinho, 2005).Kashani et al. (2022) examine the evolution of catch-up studies and suggest that catch-up can be measured by a range of indicators, including productivity, income and technological capability.The primary focus of this study lies in the technological aspect of catch-up, defined as the significant improvements in technological capabilities by firms from technologically disadvantaged nations as they close the gap with advanced incumbents, moving closer to the global technological frontier (Miao et al., 2018).
Theoretically, Bell and Pavitt (1993) introduce a framework to conceptualize technology as a capability in the catch-up process.This framework emphasizes that technological capabilities, representing a firm's capacity to absorb and learn from imported technology, are critical determinants of successful technology transfers in developing countries.It has underpinned numerous empirical studies examining the growth of latecomer firms and the impediments to their leadership emergence.Studies on technology catch-up fall into the following two main categories based on research methods: qualitative case studies and quantitative empirical research.Qualitative studies have explored the success of catch-up among Asian firms in diverse industries, including consumer electronics, automotive and shipbuilding (Cho et al., 1998;Kim, 1998;Fan, 2006;Mathews, 2006).In quantitative research, patent data, often regarded as a common proxy for technological knowledge, have gained prominence in monitoring the technological catch-up process.
Considering catch-up as a learning process, prior research has used patent citations to track technology acquisition.Wang et al. (2014) leverage the citations of licensees' patents to discern if latecomer firms had gleaned knowledge from prior licensing agreements.Besides, Lee (2013) conducts a comprehensive comparison of technological capabilities between Korean firms and their US counterparts, using a range of citation-based indicators such as China's internet platformers quality, originality and diversity.Although citation information has been widely used for measure patent quality and technology spillover, such information is not available in many developing countries.In this light, we introduce a novel framework that leverages patent text data to monitor the catch-up process between latecomer firms and incumbents in advanced economies.Initially, we train our own Word2Vec model based on a large-scale patent corpus.This trained Word2Vec model is then used to convert patent texts, specifically abstracts, into vector format.Subsequently, clustering analysis is performed to provide an overview of the technological landscape and detailed technical domains.Following that, two semantic-based indicators are introduced to compare the technological capabilities of Google and Baidu.Traditionally, constructing pairwise cosine similarity scores for a large-scale data set, such as one exceeding one million entries, is computationally demanding.Therefore, we use a neighborhood graph and tree (NGT) to search for similar patent pairs.Figure 1 presents the proposed research framework.

Vector space model of internet technology 3.1 Data source
To conduct a fair comparison of a US firm (Google) and a Chinese firm (Baidu), we use the patent data from USPTO and CNIPR.Specifically, we retrieve all patent application information by Google (26,383 USPTO patents) and Baidu (6,695 CNIPR patents) from the PATSTAT 2019 Spring version.We then check the IPC subgroups of these patents to identify internet-related technology patents.We identify a total of 2,350 IPC subgroups, but many of them contain a very small number of Google or Baidu patents.
We treat the subgroups with at least 100 Google or Baidu patents as a core technology of internet search engine-related business and retrieve all patents belonging to these 50 subclasses for subsequent analysis.There are 680,241 US patents and 427,628 CN patents from 1959 to 2018.The subgroups span over seven IPC classes, "F24," "G01," "G02," "G06," "G09," "'G10," "'H04," but more than 95% of patents belong to the G06 (computing, calculating, counting) and H04 (electric communication technique) classes.Figure 2 shows the number of patents by application year.It should be noted that most patent applications via CNPIR have been made within the last five years, while USPTO patent applications were made relatively earlier.A drop in patent applications in recent years comes from data truncation associated with the time lag between application and publication years, particularly for USPTO patents.

Vector space model
A myriad of patents makes it difficult to mine out useful information and relationships among them.Recent text mining techniques have been proposed to turn a document Research framework APJIE into a vector form so that existing machine learning algorithms can be used.We followed the classic Skip-gram model proposed by Mikolov et al. (2013) to build word vector representations for our patent corpus.We then calculated the document embedding for a patent by averaging all nouns occurring in that patent.To do so, we first conducted a preprocess on the patent corpus.Wang et al. (2019) noted that the word representations should be able to demonstrate multifacetedness.That is, the trained Word2Vec model should yield meaningful representations for words in different forms (e.g. in different tenses).Furthermore, many pre-trained word embedding models (e.g.Google pre-trained Word2Vec models) kept words in their original forms.
Along with this convention, without conducting lemmatization, we only removed punctuation and placed all words in lowercase, turning all digits into a token "<num>".The corpus was built on 1,107,869 patent applications We retained words with frequencies higher than four.A Skip-gram model was then adopted to build a 300-dimensional vector for each word in the corpus.Our Skip-gram model generated vector representations for 170,340 words, of which 73,780 (43%) were nouns.
From the results of this word embedding (300-dimension vector expressions for each word), the document vector d j (corresponding to the patent content expression) is computed by the following: where v i is the vector representation of word w i ; n j is the number of nouns occurring in the document d j ; and N is a set of all nouns in the dictionary.China's internet platformers

Validation of document embedding results
The document embedding results are created in two steps: (1) word embedding and (2) aggregation at the document level.In terms of the first step, we conduct a face validation of word embedding results.Specifically, we conduct k-means clustering of embedded words to check that similar words are clustered into the same cluster.The results of the clustering analysis are presented in Appendix 1.For example, the first cluster consists of "imagerelated" words, including "image," "position," "display," and "picture."The second one shows the list of text-related words ("document," "language," etc.).Accordingly, it is possible to conclude that our word embedding results are reasonable.
In the second step (aggregation at document level), we take a simple average of word embedding vectors in each document.To assess the document embedding results, we use Doc-DB patent family information.Within each patent family, all patents are based on the same invention, so the contents of these patents should be close to each other.We calculate pairwise cosine similarities of the patents corresponding to the same patent family.It should be noted that one patent family could have both USPTO and CNIPR patents.Therefore, we could evaluate document embedding results separately using US-US, CN-CN and US-CN pairs.
Figures 3 and 4 show the distribution of cosine similarity of document embedding results between patent family pairs.For a given patent family, we calculated all pairwise cosine values of US patents and then described the results separately using US-US, CN-CN and US-CN pairs.The mode points of each type of pair correspond to 1 (showing exactly the same vector), and most pairs have cosine similarity close to 1.We could conclude that our document embedding method produces reasonable results.In addition, the US-US patent family pair is relatively closer in terms of the contents, as compared to the CN-CN pairs, and the US-CN pairs are in the middle.Therefore, there may not be any systematic bias associated with the data source (USPTO or CNIPR patents), which is important to make a fair comparison between Google and Baidu in the following sections.

APJIE
Table 1 shows the results of descriptive statistics of cosine similarities of patent pairs by type of family and by type of document-level aggregation.We have again confirmed that the median point of each type of pair is close to 1 (at least 0.97), suggesting the validity of document embedding results.Table 1 also reports the results using TF-IDF weighted averages of word embedding results (figures with asterisks).The cosine similarity of these figures is even lower than that of the simple mean.Therefore, we proceed with the subsequent analysis by using the document embedding results with a simple average of word embedding vectors.

Clustering analysis
The contents of the patent corpus are explored by dividing the whole corpus into several clusters.We used k-means to conduct clustering based on the vectorized patent contents information.In terms of the granularity of clustering, we take the number of IPC subclasses, that is, 11.We could set this number arbitrarily, but it becomes difficult to gain a broad picture from too many clustering results.In addition, the number of clusters could not be too small as the whole corpus would be divided much more finely.We applied k-means clustering for 1,107,869 patents, and the word crowd of each cluster is presented in Figures 3 and 4. The number of words in this figure corresponds to the aggregated TF-IDF value of each word in each cluster (sum of patent level TFIDF to each cluster level) and can be formally expressed as follows:  China's internet platformers where D j 's are patents in cluster C, and t ji is the TF-IDF value of word w i in patent D j .Figure 5 also shows the label of each cluster, created by using this word crowd information, together with 10 patents located near the center point of each cluster (A list of titles of these patents are presented in Appendix 2).
Figure 6 visualizes the contents of 1.1 million patents, together with the location of each of the 11 clusters.For this purpose, the 300-dimensional document vectors are reduced into 2D space.We use the Uniform Manifold Approximation and Projection (UMASP), which has a superior run-time efficiency (McInnes et al., 2018).UMAP can convert high-dimensional data into a low-dimensional space while preserving both local and global structures.There are three broad types of patent content: (1) web application, such as data analytics, language modeling and web content application; (2) display interface, such as image recognition and human interface; and (3) ICT infrastructure, such as storage system, file management and mobile communication.
Figure 7 shows the share of patent applications by cluster and country (USPTO or CNIPR).
The share of ICT infrastructure patents (such as storage, file management systems and wireless communication) is found to be larger for the USA, while there are relatively more application-related patents (such as mobile user interaction and data analytics) for China.Such differences come from the difference in the timing of technological development in both countries.US patent applications started in the 1990s and grew rapidly in the early 2000s, while for China, most patent applications were submitted after 2010.Players in China, including Baidu, therefore focus more on application developments based on ICT infrastructure technologies developed by US players.
Figure 8 shows the location of Google and Baidu patents in the technology space based on the information compiled using UMAP in Figure 6.Google patents are more widely distributed in the space, while Baidu patents are concentrated in some particular fields, such as data analytics, mobile user interaction and Web search/language modeling.Google's first patent application was submitted in 1997, while Baidu started applying for patents mainly after 2009.As is shown in cross-country trends in the USA and China, Baidu focuses on application development in the process of technologically catching up with Google.
To control for cross country differences in patent contents, we calculate the revealed comparative advantage (RCA) index for Google and Baidu by cluster as follows: where P ij is patent country by firm "i" and cluster "j". Figure 9 shows RCA for Google and Baidu (i ¼ Google or Baidu) by cluster (j).It should be noted that the value of RCA is greater than 1 when a firm focuses on a particular field, and vice versa.First, the pattern of RCA by cluster is very similar across these two firms.As both are operating internet search engines, a high value can be found for web search and language modeling (Google: 2.48, Baidu: 3.36).In addition, the RCA of file management system is greater than 1 for both firms.Second, differences can be found between these firms in web content application (Google > Baidu) and mobile user interaction (Google<Baidu).This point can be explained by the difference in the ICT environment between the two countries, that is, mobile internet is diffused more widely in China.As a consequence, it APJIE is more important for Baidu to invest more in mobile specific applications, such as internet services taking user location information into account.

Technology space distribution analysis
The foregoing clustering analysis provides an overview of the technology space in terms of patenting, but it does not provide detailed information on the within-cluster distribution of individual patents.In this section, we generate statistics regarding the neighborhood China's internet platformers patents to each of over one million patents in our sample in terms of content.Specifically, we estimate the top 200 nearest patents in terms of cosine similarity to each patent.
An apparent difficulty is that deriving all pairwise cosine similarities among one million involves a massive amount of computations.We, therefore, used a NGT proposed by Sugawara et al. (2016) for indexing, which is an approximate similarity search method.NGT has been developed for efficient retrieval of relevant internet content by search engines, but it can be applied to any type of text information.Motohashi et al. (2019) use NGT results for patent titles and abstracts published by the Japan Patent Office to understand the characteristics of academic patents (as compared to firm patents).
NGT uses a tree structure for indexing network graphs efficiently.A parameter is epsilon as a range of search of nearest neighbors.There is a trade-off between the search range and search time.We fit our samples and use epsilon ¼ 0.35 with an accuracy rate of 0.997 (See Appendix 3 for details).
Figure 10 presents the average cosine similarity of the 200th nearest patents (i.e. the patents ranked 200th in terms of the cosine similarity) with each of 1.1 million patents by application year and patent authority.An upward time trend (technology space becomes denser over time) can be found in CNIPR patents, while it is not the case for USPTO patents.As a result, the cosine similarity of the 200th nearest patents for CNIPR patents (around 0.90) becomes greater than that of USPTO patents (around 0.88) on average.
Figure 11 shows the share of USPTO patents in the top 200 nearest patents by patent authority (CNIPR or USPTO).The share for USPTO patents is stable at around 70%, meaning 30% of the top 200 nearest patents are CNIPR patents.In contrast, the share for

APJIE
CNIPR patents rose until 2006, then fell.The upward trend corresponds to the period in which the number of USPTO patents increases, while a downward trend occurs when the number of CNIPR patent applications overtakes USPTO patents.More importantly, a pattern of technology divergence is revealed between the two countries, that is, increasing numbers of same-country patent pairs in terms of content similarity rather than crosscountry pairs.
The information on 200 near patents in terms of patent contents provides a picture of the technology space around the patent to be examined.As shown in Figure 12, finding near patents corresponds to drawing a border within which 200 near patents are located.The border is a hypersphere (300 dimensions) with a radius of the distance (e.g.1-cosine similarity) between the patent to be examined and the 200th nearest patent.The technology space is densely populated with surrounding patents if the radius (1-cosine similarity) is small, and vice versa.It should be noted that there are two types of surrounding patents.One is the patent applied for before the patent is to be examined, and the other is one thereafter.A patent application provides information on preceding patents, and we refer to China's internet platformers such patents as BASE.We refer to the latter as FOLLOW, as these patent applications were submitted following the patent to be examined.
BASE could be considered as a backward citation and FOLLOW as a forward citation.Hence, the number of BASE patents can be used as an indicator of the novelty of a patent (smaller BASE means more novelty), and the number of FOLLOW patents indicates the impact of a patent (larger FOLLOW means more impact).
We use this information to assess the technological capability of Google and Baidu.As is the case for citation information, this indicator can be biased by data truncation, that is, the newer the patent to be examined, the more BASE patents and the fewer FOLLOW patents could be found.Therefore, we normalized the number of BASE and FOLLOW (200-BASE) using the number of patent applications before and after, respectively.In addition, there is a time trend of such indicators, particularly for CNIPR patents.As the number of patent applications increases (Figure 2) in densely populated fields (Figure 10) for CNIPR patents, IMPACT tends to be larger, while BASE is smaller.Therefore, we need to control for the patent authority difference (USPTO or CNIPR).Finally, we derive the following indicator for cumulativeness (less novel) and impact for each patent: where BASE i and FOLLOW i are the number of BASE patents of patent "i" with application date "T" and patent authority "c" (US or China), and Pt is a patent count of patent applications at the application date "T."Here we conduct double normalization by the timing (BASE is normalized by the number of patent applications before the patent to be examined, all candidate of BASE and the same for FOLLOW) and by the country of patent authority.Figure 14 shows the impact indicators of Google and Baidu.Google's performance is stable over time around 1, reflecting an average impact under US standards.However, the China's internet platformers impact of USPTO patents is found to be more than average (around 1.2), while the impact of CNIPR patents is less than average (0.7 to 0.8).In contrast, Baidu shows quite dynamic patterns for this indicator.While the overall impact indicator has recently fallen, USPTO neighbor patents reveal an increase regarding this indicator.Together with the finding in Figure 13, Baidu is found to pay more attention to technological development in China and started patenting in mainstream technologies in the USA so that both cumulativeness and impact measured by US patents increase over time.It should be noted that the USPTObased impact indicator has recently become greater than 1, suggesting Baidu has achieved technological catching up with US players to some extent.

Conclusion
Technology upgrading of China's internet platforms has received growing attention given their huge data assets of a billion mobile users together with ample engineering talents for AI and data science.China has set a goal of becoming a global leader in AI by 2025, and it is

China's internet platformers
We extract internet-related technology patents from USPTO and CNIPR patent publication information to determine the technology trajectory of both countries' patent applicants.Internet-related patent applications to CNIPR have increased significantly in the past five years, and the contents of patent applications in both countries are found to be diverging.

APJIE
This may be due to the fact that China's internet market is segmented from the rest of the world and evolving in its own way.The rapid progress of mobile internet in China also explains the difference in technology portfolios across the two countries.
Given such general trends of technological development, Baidu and Google show similar patterns of focused areas of R&D in general, such as web search technology and data analytics for language modeling, based on common business models based on internet search engines.However, our results reveal some differences, such as more mobile applications in Baidu and more web content applications in Google.In terms of the dynamics of technological development, Baidu follows a trend of US rather than Chinese technology, and it is assumed that Baidu is aggressively seeking to catch up in the process of technological development.At the same time, the impact index of Baidu patents increases over time, suggesting its upgrading of technological competitiveness.
This study proposes a new methodology to analyze technology mapping and evolution based on patent text information.The citation information has been used extensively for patent characteristics (mainly patent quality) and technology spillover (Nagaoka et al., 2010).However, patent citation information is unavailable in many countries, including China.In contrast, the proposed methodology offers wider geographic applicability, particularly when using patent information in developing countries, due to the availability of patent abstract information in most nations.Furthermore, recent studies have highlighted the utilization of companies' Web pages to monitor their market-side opportunities (Park and Geum, 2022;Motohashi and Zhu, 2023).As web data are also in a textual format, our China's internet platformers proposed methodology can be easily applied to these datasets for a better understanding of market-side catch-up and competition.
However, there are also some limitations in our methodology.First, we use fixed word embedding information over time.The content of the same term, such as "machine learning," for example, should change over time as its technology progresses.Therefore, our document embedding results could represent a range of various technologies, while it is weak to measure the progress (or depth) of some particular technology component.Using a word embedding methodology that takes the context of each word within paragraphs into account, such as BERT, maybe a potential solution.In addition, the size of neighbor patents (200 in our case) is arbitrary.We could decrease or increase this size, but the number depends on the scope of our analysis or the degree to what extent we want to identify the density of technology (patent) distribution.We may use the kernel smoothing technique in multi-dimension space for future research.
Figure 1.Research framework Figure 2. Internet-related patents by application year

Figure 3 .
Figure 3. Distribution of cosine similarity between pairs within patent families Figure 4. Histograms of cosine similarity between pairs within patent families Figure 5. Word crowd of clustering results Figure 5.
Figure 6.UMAP visualization of patent contents and clustering results

Figure 7 .
Figure 7. Composition of patent contents by country Figure 8.Comparison of Google and Baidu patents Figure 8.
APJIEassumed that BAT (China's GAFA equivalent) will play a vital role.Using Google as the benchmark, this study assessed the technological capability of BAIDU.We use patent text information (abstract of invention) to examine how these two firms have developed over time.
Figure 9. RCA of Google/Baidu patents in each country Figure 11.Share of USPTO patents in 200 neighbors by country Figure 13.Cumulativeness of Google and Baidu patents

Table 1 .
Note: (*) denoted the results of TF-IDF weighted document embedding Source: Created by authors