Who are the 100 largest scientific publishers by journal count? A webscraping approach

Purpose – How to obtain a list of the 100 largest scientific publishers sorted by journal count? Existing databases are unhelpful as each of them inhere biased omissions and data quality flaws. This paper tries to fill this gap with an alternative approach. Design/methodology/approach – ThecontentcoveragesofScopus,Publons,DOAJandSherpaRomeowere first used to extract a preliminary list of publishers that supposedly possess at least 15 journals. Second, the publishers ’ websites were scraped to fetch their portfolios and, thus, their “ true ” journal counts. Findings – The outcome is a list of the 100 largest publishers comprising 28.060 scholarly journals, with the largestpublishing3.763journals,andthesmallestcarrying76titles.Theusual “ oligopoly ” ofmajorpublishing companies leads the list, but it also contains 17 university presses from the Global South, and, surprisingly, 31 predatory publishers that together publish 4.606 journals. Researchlimitations/implications – Additionaldatasourcescouldbeusedtomitigateremainingbiases;it is difficult to disambiguate publisher names and their imprints; and the dataset carries a non-uniform distribution, thus risking the omission of data points in the lower range. Practical implications – The dataset can serve as a useful basis for comprehensive meta-scientific surveys on the publisher-level. Originality/value – The catalogue can be deemed more inclusive and diverse than other ones because many of the publishers would have been overlooked if one had drawn from merely one or two sources. The list is freelyaccessibleandinvitesregularupdates.Theapproachusedhere(webscraping)hasseldomlybeenusedinmeta-scientificsurveys.


Introduction
There is no complete and freely accessible catalogue of all scientific publishers and their journals. Since there may be tens of thousands of active publishers, a project that uses a sample of journals to assess meta-scientific trends could be content with analyzing only the largest publishers. This superlative can be defined by the yearly volume of paper outputs, by the annual profit margin, by the size of the publishing company, by the reputation among the academic community, or by the number of journals published. The present paper is interested in the latter; for, while publishers with high journal counts are believed to amount only to a tiny share of the scientific publication ecosystem, they are nevertheless assumed to process the vast majority of the scholarly output (Pollock, 2022, based on data from OpenAlex, cf. Priem et al., 2022). But how would one proceed to identify the, say, hundred largest academic publishers by their journal counts?
Bibliographic platforms may offer a first solution; but while there are indeed large databases of scientific outlets, they usually do not aim at comprehensiveness. As a result, their samples of publishers and their journal counts diverge significantly. Web of Science, for instance, is more exclusive than Scopus which, however, is likewise not promiscuous (Mongeon and Paul-Hus, 2016); like other databases, it instead proclaims a set of criteria that are to be fulfilled before a publisher can have its journals indexed, leading to potentially large-scale omissions. Smaller catalogues follow specific rationales and thus do not intend to achieve an all-inclusive overview of the landscape of scientific publishing. The lists at the Directory of Open Access Journals (DOAJ) or the Open Access Scholarly Publishers Association (OASPA), for instance, only record journals and publishers that fulfil conditions pertaining to open access policies. The possibly most comprehensive dataset, the one crawled by Google Scholar, may have harvested an impressive directory of publisher-level and journal-level data, but it is not available openly, thereby remaining invisible to the public (Harzing, 2014). Other options do not offer viable alternatives either; Sci-Hub (Himmelstein et al., 2018) does not transparently disclose its coverage source, and only comprises articles with a digital object identifier (DOI)but not all publishers necessarily use DOIs. CrossRef faces the same issue regarding DOIs, and adds to the difficulty by not listing "publishers", but rather "members" which may or may not overlap with the legal entity of a publisher. For instance, among the largest CrossRef members are Cairn, JSTOR, African Journals Online and others, all of which are not publishers themselves, but rather offer "digital library" platforms harbouring works from various sources pertaining to multiple publishers. Browsing through the list of members already indicates that the share of non-publisher organizations is so large that filtering them would require an immense amount of detailed, manual labour [1]. The same issue of "over-inclusion" applies to Scilit.net, a database maintained by MDPIit likewise includes Cairn or African Journals Online erroneously as "publishers". Other web platforms, such as JournalTOCs, exhibit the same issue, as they list SciELO, RMIT Publishing (Informit), Project MUSE, Sabinet Online, Redalyc, Erudit, Nepal Journals Online, or Bangladesh Journals Online among their (largest) publishers despite their character as data aggregators rather than actual publishers. The most promising development with regards to high quality meta-scientific data, OpenAlex (Priem et al., 2022), is still in its early days under construction as of mid-2022; it remains to be seen how well the publisher-level data will be curated. Finally, Ulrichsweb remains a commercial database that is inaccessible to a broader audience, and even with a subscription, users cannot download a holistic catalogue of publishers and their journals; instead, the online platform only offers results based on specific user-inputs. Using Ulrichsweb, one could obtain a glance regarding the largest publishers based on journal counts when one queries for active scholarly journalsthe query would be But this glance remains limited to the few dozens of top options, and already this limited list contains multiple variations of publisher names (Figure 1). In brief, in searching for a list of the largest academic publishers by journal count, one will only encounter a heterogeneous, often incomplete blend of noisy and fragmentary numbers.
But without a near-complete catalogue of publishers and journals, any researcher risks omissions. An analyst who usually covers STEM (science, technology, engineering and math) disciplines may overlook, for example, the publisher Philosophy Documentation Center which possesses 249 journals; a social scientist may not know of the World Scientific despite its portfolio size of 204 journals; and a Western scientist may easily miss the Chinese company KeAi (with 130 journals) or the Indonesian press of Universitas Gadjah Mada (with 123 journals).
To fill this gap, a webscraping approach could aid in generating a list of major academic publishers as well as their journals. Due to coverage biases inherent to every platform, this approach should webscrape not just a single, but rather multiple research-related sources. The underlying rationale thus resembles a "Swiss cheese model", where a given layer (or platform) has various holes (or flaws and omissions), but if multiple layers are stacked together side by side, losses can be prevented since the holes (or flaws and omissions) differ in their position. Accordingly, the project presented here first fetches data from four large research-related platforms to obtain a list of publishers that are supposed to be mid-sized or large according to each platform respectively. As a second step, it accesses each of these publishers' websites to scrape their journal count, so as to filter out only the largest publishers among the collected sample. The aim is thus to generate a catalogue of major academic publishers and their scholarly journals, a list that is supposed to be more comprehensive, accessible and inclusive than any of the existing oneswhile still being focused only on publishers with voluminous portfolios (to reduce the data-collection burden). Moreover, the list should not merely offer a snapshot of a specific moment but be adaptable over time; this possibility of always having the data up-todate is guaranteed by a public sharing of the codes so as to enable extensions and reiterations of the webscraping process.
The following describes the methodical approach in greater detail. The chapter afterwards presents the results of the top 100 academic publishers, sorted by the number of serial titles they publish, with interesting findings regarding the relatively high shares of Global South university presses on the one hand, and of allegedly predatory publishers on the other hand. The discussion section then outlines various limitations encountered during the research process, including issues of data quality due to the non-uniform data distribution, or the difficulty of disambiguating imprints. The paper concludes with a possible guidance on how the limitations nevertheless point towards future research paths so as to reach the wider goal of a complete overview of academic publishers and their scholarly journals that could serve as a starting point for broad meta-scientific investigations.

Methods
To generate a comprehensive list of academic publishers and their scholarly journals, two separate methodical steps were necessary. The first one comprised data collection on the publisher-level. Based on the preliminary results of that first step, the second one proceeded with gathering journal-level data, or at least the respective journal count. The following will describe the respective approach in sequence.
The data and the codes are available in a Zenodo repository at https://doi.org/10.5281/ zenodo.7081147 under a Creative Commons-license (CC0).
Publisher-level data Data sample and data collection. One single data source seems insufficient when one seeks to attain a complete overview over the landscape of scholarly publications; for each source inheres its own biases and indexing criteria. Instead, one should draw from multiple platforms. While heterogenous in character and scope, they may, taken together, provide a more complete menu of publishers than if one merely used a single database.
The present project thus uses four data samples, each of which comprises not only a large list of academic publishers, but also (at least implicitly) the number of journals assigned to them.
The first one is Scopus, a large-scale database of scientific publications that provides an openly available source title list. Using their source list from October 2020 comprising 40.804 journals in total, the names of the publishers were extracted and their frequency (i.e. journal count) counted.
The second data sample, Publons, is a platform designed to document and verify peer reviews. It allows anyone to register a referee report conducted for any journal from any publisher (Van Noorden, 2014). It thus follows a "bottom-up" approach which potentially covers even publishers that tend to be invisibilized in other indexing services. Using webscraping with R's rvest library (Wickham and RStudio, 2020), this project accessed Publons' directory of publishers ("All publishers", n.d.).
The third source is DOAJ, a directory of open access journals aiming at a global coverage of scholarly publishers and journals that adhere to standards of open access publishing. To Largest scientific publishers fetch the relevant information, this project used the JSON-formatted journal metadata from DOAJ's public data dump.
The final source of publishers used was Sherpa Romeo, a website which aggregates open access archiving policies from a growing number of more than 4.000 publishers. Their publisher list was scraped with R.
All these data were collected on 11. December 2020. Data analysis. Having collected four datasets comprising publisher names and their number of journals according to each respective platform, this project joined these datasets together, harmonized some publisher names, and extracted the highest journal count per publisher. For example, if the publisher Copernicus Publications had 41 journals in Scopus, 47 in Publons, 40 in DOAJ, and 71 in Sherpa Romeo, that publisher was assigned the maximum journal count of 71. This count was only a preliminary one; the real number of journals would be verified later (as will be outlined below).
After garnering these data, the list was sorted by the preliminary number of journals in descending order. In total, there were 24.722 distinct publisher names. As resource constraints made it impossible to look at each of the publisher distinctly and thoroughly, a threshold was chosen that would leave one with a still-manageable sample while ensuring that the result would still be a plausible list of the largest publishers. With that threshold, only publishers that supposedly carried at least 15 titles according to any of the four data sources were keptfor example, since Copernicus Publications had been assigned the preliminary count of 71 journals (above the threshold of 15), it remained in the sample for further validation of its journal count. The threshold was chosen because it seemed low enough to ensure that all publishers that would make it into the final list would pass that threshold, even if the four data sources did not have a complete portfolio of these publishers; in this sense, the lower the threshold, the more complete will be the final data. However, the threshold should not be too lowit should rather be high enough to yield a sample that would be manageable for a manual verification of each publisher's journal count. In other words, as one lowers the threshold, the sample size increases, and thereby the likelihood of detecting yet another large publisher that will make it into the final list becomes greater. However, larger sample sizes require more resources, and there may be "a point where an effect [of increasing the sample size] becomes so minuscule that it is meaningless in a practical sense" (Alba-Fern andez et al., 2020, p. 14). The threshold of 15 journals may have allowed for sufficient data to create a reliable top 100 list (cf. the superficial assessment in the Results section below).
Preliminary publisher-level results. A preliminary result extracted 568 distinct publisher names that supposedly published at least 15 journals, according to any of the four data sources DOAJ, Publons, Scopus or Sherpa Romeo.
This preliminary list was then cleaned manually, as there were obvious data quality issues such as inflated numbers and unharmonized publisher names. The manual refinement also got rid of duplications, discontinued presses and non-publishers (e.g. Egyptian Knowledge Bank or SciELO), resulting in a preliminary list of 414 academic publishers.

Journal-level data
Based on the preliminary list that resulted from the publisher-level data collection, the next step was to visit each listed publisher's website to find the respective portfolio of journals. In order to webscrape each publisher's respective journal list, the so-called CSS [2] selectors that harbour the names and the links of the journals were required. The manual collection of these CSS selectors for each of the 414 publishers was undertaken in January 2021 (and updated in mid-2022). The respective publisher websites were then scraped between March and July 2022, fetching data about journal names and journal counts [3], finally filtering the 100 largest publishers according to these webscraped journal counts.

Results
The outcome of the data-collection resulted in a catalogue of the 100 largest academic publishers (comprising 28.060 serial titles) based on journal counts. Summary statistics are visible in Table 1.
Ordered by journal counts, the top ones resemble the prominent "oligopoly" of academic publishing (Larivi ere et al., 2015) -Springer, Taylor & Francis, Elsevier, Wiley, and SAGE lead the list. Many of the middle-ranging ones, however, may offer surprisingly unknown or only faintly familiar names to researchers whose usual range is confined to just a single, specific discipline or to a single, specific region.
Of the 100 largest publishers, 17 are university-based presses headquartered in research institutions at the Global South (perhaps surprisingly; cf. Collyer, 2018). Eight of them are from Latin America (cf. Delgado-Troncoso and Fischman, 2014), while seven are based in Indonesia (cf. Irawan et al., 2021;Wiryawan, 2014)including the largest among them, the Universitas Pendidikan Indonesia that publishes 177 journals. One press from Iran and Malaysia each round up this subset of Global South university presses.
Another possibly surprising result is that the list contains a large share of so-called predatory publishersnamely, 31 out of 100 [4]. Most of the allegedly predatory publishers in the present list even publish more than one hundred titles; the largest one, OMICS, even has 705 journals in its portfolio, propelling it into the sixth place of the overall ranking. In total, they publish 4.606 outlets, or almost 16.5% of all journals covered by the 100 publishersroughly every sixth journal of a major publisher is a predatory one. Admittedly, the attribute of predatoriness is a contested one, but in its core, the term denotes organizations that publish seemingly scientific articles against monetary charges without offering an authentic peerreview, while at the same time conducting dishonest practices such as deceiving the public of Largest scientific publishers wrong impact factors, or listing researchers as editorial board members without their knowledge (Cobey et al., 2018, p. 8). Such (allegedly) predatory publishers are usually left out by curated databases for ethical reasons, but for comprehensive meta-scientific surveys, it may be useful to not exclude them.
The top 100, sorted by journal count, is visible in Table 2. Some of the publishers listed are not indexed in all four data sample platforms, meaning that they would have been overlooked if this project merely drew from one or two sources. This is especially the case for the so-called predatory publishers; for instance, OMICS (with 705 titles) was missing at both DOAJ and Sherpa Romeo; or, if one only used DOAJ and Scopus as relevant sources, then one would have omitted Gavin Publishers (with 168 journals) and Scientific and Academic Publishing (comprising 149 titles); and if one drew from just Publons and Scopus, then Open Access Pub (boasting 198 journals in its portfolio) would not have been found.
However, non-predatory publishers like university presses would have suffered a similar fate; for example, the press of Universitas Negeri Semarang which has 120 journals would not have been found if one merely collected publishers that had any reviews verified at Publons.
The "Swiss cheese model" approach of using various layers, or multiple research-related platforms for data-collection, thus helped to prevent potential data losses. This is not to claim that the result is exhaustive and accurate, as the Discussion section will consider below. There still may be omissions, especially in the lower ranks of the listthe distribution is so non-uniform that the upper "cloud" of the ranking is likely accurate, while the "tail" is rather noisy. To give a rough impression of how accurate the ranking is, at least with regards to the four data sources used here, one can slice the original sample (the unharmonized one comprising the 414 publishers that had at least 15 journals according to either of our four data sources) into ten deciles, with the tenth decile showing the largest publishers and the first decile the smallest ones. Each decile contains 41 or 42 publisher names. In the tenth decile, the vast majority of the publishers (87.8%) made it into the final top 100 list; in the ninth decile, that share fell to roughly a half (48.8%). The eigth decile was down to less than a fourth (22.0%). In general, there is a clear downward trend (with a few exceptions) until the first decile, which had just 2.4% of its publishers in the final list (see Table 3). With each decile, the median decline in percentage points was À7.1%, so that one could except a further quantile to have an even lower probability that any of the listed publishers there would make it into the final list. While such statistical numbers do not guarantee that the final top 100 list is accurate, they do provide confidence that the probability of errors is not overly high, at least given the four data sources here; and even if one demanded higher precision, the paper's purpose was primarily to demonstrate the utility of a method (webscraping) rather than to execute it until perfection.

Discussion
Webscraping, first, multiple databases of scientific indexing services, and second, the publishers' websites themselves offers an effective way to obtain a comprehensive overview  Largest scientific publishers of the landscape of academic publishing, at least when it comes to large publishers in terms of the number of journals in their portfolio. The present project utilized data from Scopus, Publons, DOAJ and Sherpa Romeo to automatically enumerate a list of major academic publishers and their scholarly journals as complete as possible. It first gathered a list of publishers that allegedly published at least 15 journals, before validating each publisher's JD 78,7 journal count that resulted in a catalogue of the 100 largest academic publishers comprising 28.060 scholarly periodicals.
Many of these publishers, especially in the mid-and smaller range, would have been omitted if one had drawn only from a subset of the databases. This is especially pertinent to those that are either located in the Global South (Collyer, 2018;Jimenez et al., in press, pp. 4-5;Okune et al., 2018;Teixeira da Silva et al., 2019) or that publish articles in languages other than English ("LOTE") (Ren and Rousseau, 2002;Vera-Baceta et al., 2019). They are not always indexed in the major scientific databases, and some of them do not issue DOIs for various reasons, making it easy to overlook them in conventional searches. Examples include the Iranian press of the University of Tehran (with 115 journals), the Chinese one of KeAi (130 journals), the major Indonesian players like the presses of Universitas Gadjah Mada (123 journals), Universitas Negeri Semarang (120 journals) and Universitas Diponegoro (87 journals), Eastern European publishers like the Editura Academiei Romane (76 journals), or Latin American entities belonging to the Universidade de Bras ılia (86 journals) or to the Universidad Nacional Aut onoma de M exico (127 journals). The fact that the present project did not omit them indicates that the catalogue gathered here might be less susceptible to systemically biased omissions than if one had used merely one or two sources.
The list generated by this project thus offers a gateway towards large-scale analyses regarding macro-scale engagements, actions and policies of publishers and journals. May they relate to open access aspects, to the conduct of peer review, to article processing charges, to the availability of metadata or to editorial boardswhatever the use case, a webscraping approach that gathers meta-scientific information seems to offer a viable path for alternative and inclusive samples. And it is on the basis of these samples that one can thoroughly investigate existing research cultures in all their diversity.
In addition, as all the present paper's codes and data are shared publicly, they can find extension so as to cover further data sources, and they me be executed repeatedly to update the catalogue over time.
However, there are various weaknesses and limitations to be discussed. First and foremost, while the upper "cloud" of the dataset may accurately depict the league of the largest academic publishers, the mid-and lower ranges (or "tail") may be more susceptible to noisy errors and omissions. In other words, the dataset is most likely an imbalanced one due to the non-uniform distribution of the underlying data (Kotsiantis et al., 2006). That is, there is a high probability of the largest publishers to occur in any of the four samples, but the smaller the publisher, the less  Largest scientific publishers likely it is that one identifies them through webscraping the four sources (a problem of undersampling). After all, the use of multiple platforms does not dispense with the necessity to be aware of inherent biases; it is possible that there are still enough publishers that have not made it into any of the four data samples used for this project. Such biases could be mitigated by drawing from more and more sources. CORE (Makhija et al., 2018), JSTOR (Schonfeld, 2012), BASE (Pieper and Summann, 2006), OpenAIRE Explore (Alexiou et al., 2016), the Directory of Free Arab Journals (DFAJ) (2021), SciELO (Packer, 2009), the Iranian Scientific Information Database (SID.ir), or African Journals OnLine (AJOL) may serve as likely candidates, though one would first need to ensure that one can indeed obtain structured data from them.
Other data difficulties remain. The issue of disambiguating publisher names and their imprints is one that may lead to arbitrary definitions (e.g. differentiating Springer from Springer Nature and BioMedCentral, but not from Demos Medical Publishing, even though they all share the same parent companies). A related problem arises when the samples used aggregators or information retrieval platforms (such as SciELO or the Egyptian Knowledge Base) erroneously as publishers. This is one reason why CrossRef's member list or Scilit could not be used as data sources for the present project. A further limitation lies in the fact that some of the journals listed in the publisher's online catalogues may be discontinued or inactive (Cortegiani et al., 2020). The next step should thus necessarily entail a closer and possibly manual assessment of each publisher's precise journal count.
Once these limitations are addressed, the webscraping approach outlined here may fill a gap in the meta-scientific literature, especially with regards to exhaustive surveys of university presses, scholarly publishers and scientific journals. Without a reliably and freely available comprehensive list, scientometric examinations would risk an incomplete coverage of the diverse landscape of academic publishing, leading to a structural invisibilisation of underrepresented journals or an underestimation of the extent to which predatory publishers have occupied the scientific ecosystem.
With additional data refinements and even more encompassing, alternative sources, the list may finally attain a satisfying degree of saturation and accuracy. Once one can be certain that there is a complete and inclusive catalogue of academic publishers and scholarly journals from all around the world without any blind spots, this cannot but benefit the whole science of science. 2. Cascading Style Sheets, a computer language used for layouting and structuring websites (usually in conjunction with HTML, or Hypertext Markup Language).