Labour market segmentation and the gender wage gap in Spain

Purpose – The purpose of this study is to analyse the gender wage gap (GWG) in Spain adopting a labour market segmentation approach. Once we obtain the different labour segments (or idiosyncratic labour markets), we are able to decompose the GWG into its observed and unobserved heterogeneity components. Design/methodology/approach – We use the data from the Continuous Sample of Working Lives for the year 2021 (matched employer – employee [EE] data). Contingency tables and clustering techniques are applied to employment data to identify idiosyncratic labour markets where men and/or women of different ages tend to match/associate with different sectors of activity and occupation groups. Once this “ heatmap ” of labour associations is known, we can analyse its hottest areas (the idiosyncratic labour markets) from the perspective of wage discrimination by gender (Oaxaca-Blinder model). Findings – In Spain, in general, men are paid more than women, and this is not always justified by their respective attributes. Among our results, the fact stands out that women tend to move to those idiosyncratic markets (biclusters) where the GWG (in favour of men) is smaller. Research limitations/implications – It has not been possible to obtain remuneration data by job-placement, but an annual EE relationship is used. Future research should attempt to analyse the GWG across the wage distribution in the different idiosyncratic markets. Practical implications – Our combination of methodologies can be adapted to other economies and variablesandprovidesdetailedinformationonthelabour-matchingprocessandgenderwagediscriminationinsegmentedlabourmarkets. Social implications – Our contribution is very important for labour market policies, trying to reduce unfair inequalities. Originality/value – The study of the GWG from a novel labour segmentation perspective can be interesting for other


Introduction
When economists wonder why men receive higher wages than women or try to analyse how this wage gap evolves over time or behaves within the wage distribution itself, they usually resort to the so-called decomposition methods.The driving authors of this methodology are Oaxaca (1973) and Blinder (1973) both papers focus on the issue of wage discrimination.The basic idea of the Oaxaca-Blinder (OB) decomposition method consists of answering the following two questions: (1) How large is the part of the gender wage gap (GWG) that can be attributed to gender differences in those characteristics that are relevant to wages?This portion (due to the endowments of each one) is called the "explained" component of the gap.(2) How large is the part of the GWG which is due to differences in how those relevant characteristics are rewarded in the labour market for men and women?This second part of the gap (due to the coefficients/returns of each one) is called the "unexplained" component.In the context of the GWG, this second component is often interpreted as "discrimination," at least partially (Jann, 2008) though there is not always discrimination behind it.
Although the Spanish labour market is showing a positive evolution after the COVID-19 pandemic, with 21 million employed people and an unemployment rate of 11.6% in the second quarter of 2023, the truth is that it continues to be a problematic market characterised by elevated long-term unemployment, high youth unemployment, strong segmentation, low regional and occupational mobility and wage inequality between men and women.To address the issue of gender pay inequality, in this paper, we combine the OB decomposition technique with the empirical framework of clustered contingency tables (CTs).We use labour-matching data from a large database of administrative records, the Continuous Sample of Working Lives (Muestra Continua de Vidas Laborales, MCVL) and structure them into a clustered CT that cross-classifies the information on gender and age of workers and occupation group and activity sector of job placements.The clustered CT allows us to represent the labour market in a segmented way.From this table, we can obtain a heatmap of the labour market that shows how workers, depending on their gender and age, tend to associate with different occupation groups and sectors of activity in the labour market.This heatmap allows us to identify idiosyncratic labour markets of men or women or both where pay inequalities can be very different.In our opinion, the problem of wage inequality cannot be approached as a problem of the labour market as a whole but has to be analysed in a segmented way, looking at each labour market segment.This segmented vision of labour matching in Spain will help to better design measures against the GWG.
There are several theoretical models that can offer support to our empirical analysis, such as the theories of labour market segmentation; the two-sided matching models (Gale and Shapley, 1962;Roth and Sotomayor, 1992), where the occupations and activities chosen by men and women would fundamentally depend on their respective preferences; or the models where companies have sufficient market power to allow themselves to discriminate against certain groups of workers.In this last line, we highlight Becker's (1971) "discrimination of taste" model.Becker (1971) stated that an aversion felt by employers toward persons belonging to certain groups might constitute a source of discrimination and lead to lower wages for discriminated workers.He presented this hypothesis in formal terms by assuming that the gains these employers derive from employing workers include the profit of the firm and some taste parameters.However, such discrimination cannot persist under perfect competition, as employers with no preference will drive employers with discriminatory preferences out of the market by offering all workers equal wages.Hence, the presence of imperfect competition in the labour market is necessary to explain the existence and persistence of discrimination.In this sense, limitations on personal mobility permit firms to exercise monopsonistic power and to pay workers with identical productive abilities differently (Cahuc et al., 2014, Ch. 8).This low-mobility argument has been used to explain discrimination against women and certain ethnic minorities (see, for example, Gordon and Morton, 1974;Barth and Dale-Olsen, 2009).Our research hypothesis is that the mobility limitations of workers between occupation groups and/or sectors of activity in Spain may be giving companies market power in certain labour market segments and the opportunity to discriminate against workers according to their preferences.
There is literature that uses decomposition methods to analyse the GWG in Spain (and other countries).For instance, Hidalgo (2010) uses quantile regression to simulate counterfactual densities and decompose the Spanish wage inequality evolution (period 1980-2000) into changes due to coefficients, endowments and non-observable worker characteristics (following the methodologies of Machado and Mata, 2005;Autor et al., 2006Autor et al., , 2008)).The data used are coming from the Household Budget Survey and the MCVL.According to this author, wage inequality follows a counter-cyclical trend from the mid-eighties onwards and changes in both, coefficients and endowments, play an important role in this evolution.Guner et al. (2014) observed a GWG in Spain of around 20% in 2010, a figure quite close to its value in 1994.The authors use data from the Spanish Labour Force Survey (Encuesta de Poblaci on Activa, EPA) from 1977 to 2013.Using two decomposition methods (OB and decomposition using quantile regression) and considering the problem of sample selection bias, they observe that the GWG is driven mainly by differences in returns to individual characteristicswomen are more qualified than men in observable labour market characteristics but earn less.The same techniques are implemented by Dueñas and Moreno (2018), which analysed the GWG in the Spanish, French and German labour markets in 2015 using microdata from the EU Statistics on Income and Living Conditions (EU-SILC, 2016).The results obtained indicate that Spain is the country with the lowest wage gap and the biggest wage discrimination, Germany being the country with the biggest wage gap and the lowest wage discrimination and France in an intermediate position.Thus, in the case of Spain, practically the whole GWG is due to the unexplained part of the OB model (98.29%)this percentage is lower in the cases of France (72.84%) and Germany (58.13%).For their part, Murillo-Huertas et al. (2017) examine regional differences in the GWG in Spain using matched employer-employee (EE) microdata : 2002, 2006 and 2010 waves of the Survey of Earnings Structure (Encuesta de Estructura Salarial, EES).Their findings suggest that Spain shows a significant regional heterogeneity in the size of the raw gap.Their OB decomposition analysis shows that although the bulk of the GWG in Spanish regions is due to differences in the endowments of productive characteristics between males and females, there is still a substantial part of the gender gap that remains unexplained.
There is also literature for the Spanish economy on wage differentials between groups or collectives other than men and women but where the gender variable plays an important role as a control variable.For example, by adopting a regional perspective, Garc ıa and Molina (2002) show that being a man reduces the wage gap between the different regions analysed (North, East, South or Centre of Spain) and Madrid.Such a discriminatory effect is higher in the North, East and South than in the Centre of Spainon interregional wage differentials in Spain, see also Murillo-Huertas et al. (2020).In the field of labour insertion, we can highlight the paper of Arrazola et al. (2022).These authors show that there are gender differences in the labour insertion process of recent Spanish graduates.These differences are, in general, systematically negative for women and are especially important for the salaries received and the type of contract (part-time/full-time, temporary/permanent contract), although they also depend on the branch of knowledge of the studiesfor example, in the engineering branch, the gender gap in the probability of having a relatively high salary (which is unfavourable to female graduates) is explained almost entirely by unobservable institutional or socioeconomic factors, i.e. by the unexplained part of the probability gap.Another wage gap analysed for the Spanish economy is the one that arises when comparing wages in the private sector and the public sector.For instance, this gap is analysed by Couceiro de Le on and Dolado (2023).These authors find that those unobserved female characteristics which IJM 45,10 increase the probability of working in the public sector have a favourable impact on wages, but this public wage premium is only observed for low-educated women.These findings are in part consistent with those of Hospido and Moral-Benito (2016) who find positive selection towards the public sector at the bottom of the wage distributionin this field, see also the study of Ant on and Muñoz de Bustillo (2015).
From the reviewed literature, at least three conclusions can be drawn: (1) It is evident that there is a wage gap unfavourable to women in the Spanish labour market and that a significant part of this gap is due to unobserved factors that affect either the constant of the Mincer wage equation or the return of their explanatory variables.(2) The use of the Spanish MCVL in GWG analysis is not common, and even though these data contain remuneration information at the individual level and a wide range of explanatory variables for that remuneration.(3) The reviewed literature clearly shows that the seminal methodological contributions of Oaxaca (1973) and Blinder (1973) have given way to a broad set of methodological advances that allow decomposition methods to adapt to different data structures and research questionssee the survey by Fortin et al. (2011).Among other improvements in the decomposition methodology, we can mention the following: incorporation of standard errors and confidence intervals into the estimates; contributions of simple covariates to the explained and unexplained components of the gap; treatment of dummy variables as explanatory variables; decomposition of discrete choice models; correction of sample selection bias (Heckman, 1979); decomposition of differences in mean outcome differentials (Smith and Welch, 1989;Juhn et al., 1991); combination of decomposition and matching techniques ( Ñopo, 2008); gap decomposition along the wage distribution using quantile information (Machado and Mata, 2005;Melly, 2005;Firpo et al., 2018); and decomposition for panel data and mixed models (Smith and Welch, 1989;Kim, 2010;Kr€ oger and Hartmann, 2021).
In many of these methodological contributions underlies, in one way or another, the segmentation of the population (or the sample) analysed.Some studies approach segmentation exogenously (according to external classifications), for example, segmenting the sample by earning quantiles (Juhn et al., 1993), or analysing the GWG for different groups in the labour market (by race, region, period of time, public or private job, etc.) leading to an analysis of the gap between the gender gaps of the groups analysed (Smith and Welch, 1989;Juhn et al., 1991).On the other hand, other authors approach segmentation endogenously (i.e. using information from the sample/population to make the segmentation), for example, admitting that there are workers with different probabilities of accessing a job and, therefore, a salary (Heckman, 1979), or looking for groups of men and women which have comparable characteristics, giving rise to the OB analysis of the common support of the distributions of observable characteristics ( Ñopo, 2008).
To our knowledge, none of the proposed segmentations control for the existence of labour segments where the propensity of young/older men/women to be matched to jobs may differ significantly.Our labour market segments are based not directly on spatial criteria (as, for example, in Manning and Petrongolo, 2017) but on how men and women of different age groups are matched with the different sectors of activity and occupation groups existing in the labour market.Combining CT and clustering methodologies, we can identify labour segments where men (or women) of certain ages show a high propensity to match/associate with certain sectors of activity and occupation groupson this methodology see the works of Alvarez de Toledo et al. (2018Toledo et al. ( ), (2020)).The fact that young/older men/women show different propensities to match up with certain activity sectors and occupation groups can be due to three reasons: (1) their respective preferences when looking for a job, (2) the preferences of companies when hiring them and (3) their search patterns and those of the companies that hire them (geographical search areas, search channels used, etc.).
Our statistical segmentation procedure is guided by an economic criterion: who matches with whom in the labour market.We want to relate the wage gap of each labour segment (or idiosyncratic market) with the propensity to match/associate men and women of different ages with the jobs offered in that segment.In principle, it would be expected that in those labour segments where the wage gap is more favourable to young (older) men, the propensity of young (older) women to match is lower and vice versa.
The rest of the paper is structured as follows: Section 2 presents our labour market segmentation methodology based on clustered CTs.Section 3 describes the labour-matching data used (MCVL).Section 4 offers the results of estimating a wage equation and decomposing the GWG adopting a segmented vision of the Spanish labour market.Finally, Section 5 shows the general conclusions of our work.

Labour market segmentation methodology
The use of CTs makes sense when there are categorical variables that provide relevant information about a phenomenon under study (Mosteller, 1968;Agresti, 2013).Each cell of the CT shows the frequency of a particular combination of categories of the different categorical variables that are cross-represented in it.From this starting table, correspondence (association) and clustering analyses can be carried out to obtain a new table containing the degree of association between the different categories of the categorical variables; we will call this table the heatmap.
The heatmap proposed in this study allows us to characterise the employment episodes of the MCVL by measuring the degree of association between the different categories of four categorical variables that have an important weight when it comes to segmenting the labour market; namely, gender and age group of the worker and occupation group and activity sector (industry) of the job.To obtain this heatmap, the first step is to cross-classify these four variables in a CT where the rows represent the combinations of the categories of the occupation and industry variables, and the columns represent the combinations of the categories of the gender and age group variables (see Table 1).Our economic hypothesis is that there are certain rows and columns of this CT which tend to be strongly associated (they tend to appear together in the CT) and this should be reflected in the heatmap; in other words, certain young/older men/women tend to match with job vacancies belonging to certain occupational groups and activity sectors.When we refer to strong association, we do not necessarily mean that a particular combination of categories of the four variables considered has a relatively high frequency in the CT, but that this frequency is higher than the one expected if the generation of category combinations were totally random.
By crossing 10 occupation groups and 10 activity sectors, 100 rows are generated in the CT and heatmap.By crossing 2 gender categories with 8 age groups, 16 columns are generated in the CT and heatmap.Therefore, these two tables have a total of 1,600 cells (16 rows 3 100 columns).This high number of cells makes it difficult to identify association patterns in the heatmap of the labour market.To overcome this problem, we smooth the observed CT and apply a clustering procedure to its two sides; in this way, the analysis focuses on homogeneous groups of rows and columns rather than on individual rows and columns.The procedure to obtain the smoothed ordered (clustered) CT and the corresponding heatmap can be seen in Alvarez de Toledo et al. (2018Toledo et al. ( , 2020)).Here we summarise it in the following steps: 1.The observed CT is smoothed resulting in Table 1.Smoothing techniques in the CT framework provide solutions for estimating cell frequencies (and their probabilities) in the presence of "sparsity."When in a CT the number of cells is too high and/or the finite sample is too small, some cells with positive occurrence probabilities can be zero or have a very small frequencythis phenomenon is known as a zero frequency problem or sparsity problem.In this scenario of sparse high-dimensional CTs, multivariate statistical analyses (as, for example, correspondence analyses, association factors or χ2 tests of independence) may lose the optimal properties that they show in larger samples.
2. From the smoothed and unclustered CT (Table 1), two auxiliary tables are generated that, respectively, collect the observed and expected probabilities of having an employment episode in each cellnote that these two probabilities are calculated on the already smoothed CT.The observed probability in each cell i,j comes from the quotient n i;j =n, while the expected one is obtained as the product of the row marginal probability n iþ =n and the column marginal probability n þj =n corresponding to each cell i,j.

The quotient of both auxiliary tables allows for obtaining a table of association factors (a ij ) between rows and columns [1]:
Factor values higher than one would mean that the association between the corresponding row and column of the table is greater than in a random assignment scenario and vice versa.As an example, Figure 1 represents an association table with an arbitrary order of six rows and six columns.For better visualisation, we have coloured the cells according to the values of the association factor, the higher the factor, the darker the cell.
4. The CT is clustered on the row side and on the column side.The hierarchical (average linkage) clustering methodology is based on a similarity measure between the elements that are clustered (row categories or column categories).We measure the similarity between each pair of rows of the CT (i A and i B ) as the overlapping or percentage of coincidence of their respective row profiles . This measure of similarity moves by definition between 0 and 1 and can be calculated in an analogous way on the column side; i.e. two

Unclustered association table
Labour market segmentation column categories of the CT are more similar, the more they resemble the way they are matched with the row categories.
5. The clustering process on both sides of the CT gives rise to separate row and column dendrograms.The dendrogram graphically shows how the row (or column) categories are joined sequentially to give rise to homogeneous groups or clusters of categories.By definition, the base of each dendrogram (the one with the rows and the one with the columns) places the respective clustered categories by proximity.So, these respective bases can be used to order the rows and columns of the association table giving rise to a clustered association table or heatmap (or "gravity" map) that makes it possible to get a panoramic view of how certain groups/clusters of rows tend to be associated with certain groups/clusters of columns and vice versa.Figure 2 shows the association table from Figure 1 after it has been sorted using the information of the clustering process.As can be seen, this figure shows the existence of row clusters and column clusters.For example, rows i 1 , i 6 and i 4 form a row cluster because they are similar in the way they are associated with the columns of the table, and columns j 1 and j 2 form a column cluster because they resemble the way they are associated with the table rows.
Our segmentation scheme is not incompatible with the existence of occupational/sectoral mobility in the labour market; in fact, the existence of certain mobility between nearby occupations and/or activities may be favouring the formation of the clusters that we observe in the heatmap (for example, the mobility of older men between the agriculture and construction sectors in certain regions of Spain).It is also true that if disruptive changes in mobility patterns were observed, the heatmap would change its shape, but this type of change occurs slowly because it requires the retraining of workers.In any case, there is evidence of low occupational and sectoral mobility in Spain, which gives our heatmap stability in the short termsee, for example, Anghel et al.It can happen that a cluster of rows tends to be associated with a cluster of columns and vice versa.This case is called a bicluster (or labour market segment) and can be explored for idiosyncratic featuresthree possible biclusters have been marked with red borders in Figure 2.For example, we could look inside a particular bicluster to analyse its structure by gender and age of the worker, region of the workplace, activity sector and occupation group of the job placement, worker earnings, etc.In this study, we will focus on analysing the wage gap by gender.The GWG can be analysed by following the OB decomposition.This statistical method is used to analyse the differences in mean outcomes between two groups.As can be seen in Eq. ( 1), the OB decomposition of the wage differential helps researchers to understand the extent to which differences in observed characteristics ("endowments" or explained factors) and differences in the returns of those characteristics ("coefficients" or unexplained factors, including discrimination) contribute to the overall GWG.For example, we can measure whether men earn more than women because they have more experience in the labour market or because the same level of experience is more highly valued in the case of men.For its part, the "interaction" component shows a simultaneous effect of differences in endowments and coefficients, and it usually presents a small or negligible effect on the explained differential.
where m: male; f: female; W: average wage; k explanatory variables (excluding the intercepts β 0f and β 0m ); X k : average values of the explanatory variables; and β k : Mincer estimated coefficients.
Note that the Mincer equation, estimated, respectively, for men and women, provides the respective coefficients {β 0m ; β 0f ; β km ; β kf } that are used in the OB decomposition of GWG.A novel aspect of our analysis is that the effect of the association factor (a ij ) can be considered in the wage equations and thus in the OB decomposition.As previously mentioned, the association factor is calculated in this study on a CT where rows (i) represent worker segments defined by the age group and gender of the employee, and columns (j) represent job segments defined by the occupation group and activity sector of the job position.Under this definition of segmentation, the estimated OB coefficient for a ij lets us know who benefits most from showing a higher association (or dependence) with jobs belonging to certain occupations and activities, young or older men or women.Additionally, the formation of labour biclusters (darker areas of the heatmap) allows a segmented analysis of the wage differential.Indeed, we can analyse the GWG in those segments of the labour market where women or men (or both) tend to go (given their preferences in the labour-matching process).This segmentation analysis allows us to know both the relative situation of women in different segments of the labour market and whether their preferences in labour matching are related to going to those labour segments where the wage gap is less unfavourable for them.

Data description
The MCVL is a set of individual microdata extracted from the Spanish Social Security records.The Social Security information is completed with tax information from the State Agency for Tax Administration (Agencia Estatal de Administraci on Tributaria, AEAT) and with information from the Continuous Register provided by the Statistics National Institute (Instituto Nacional de Estad ıstica, INE).This database offers annual information on more than a million people who appear each year in the Social Security records as recipients of income from work, subsidies or pensions.To make the sample, 4% of the population registered in Social Security in a certain year are selected through a simple random sampling system; therefore, the MCVL is representative only of the population that is related to Social Security in the reference yearnote that our analysis is crosssectional, for the year 2021.
The MCVL is compatible with our research because it allows obtaining and estimating, crosssectionally (for the year 2021), the wage (Mincer) equation that explains the individual's wage through a series of variables that describe attributes of the worker, the firm and the job position that he/she occupies.In the year 2021, the MCVL contains 649,893 workers, 247,822 employers (private companies or public administrations) and 1,265,406 employment episodes (labour contracts)we only consider the employment episodes for which the information of all the variables that are going to be used in the econometric analysis is known.These episodes correspond to workers who already had a job at the beginning of the year or who found one during the year.
The wage information comes from the tax module of the MCVL, which allows us to obtain the annual income from work (whether in cash or in-kind) for each combination employee-employer (hereinafter EE) observed.Figure 3 shows the density function by gender of the annual income of the EE relationships.As can be observed, women show a higher density in intermediate wages, while men do so at higher wages.The average wage of women is V13,450 (sd 5 V14,214, median 5 V9,076), while that of men is V15,404.03(sd 5 V15,551, median 5 V11,499).There is, therefore, an average wage gap favourable to men in the Spanish economy.
The (continuous or categorical) variables that will be used in the estimation of the wage equation and in the OB decomposition of the GWG are listed in Table A1, in the Annex.For its part, Table A2 (also located in the Annex) shows the employment distribution of those variables in Table A1 that produce the CT used to segment the job matching process; namely, the gender and age group of the worker and the occupation group and activity sector of the job positiondescriptions of the rest of variables in Table A1 are available as supplementary material.
As shown in Table A2, the most frequent categories of each variable are the male gender (52.48%); the younger and intermediate age groups; the occupation groups of officers and specialists, unqualified workers and administrative assistants; and the services in the private sector, especially trade, hotels and restaurants, transport and communications and other services [2].In this section, two econometric models are estimated, one that tries to explain the wages of workers (differentiating between men and women) and another that tries to decompose the GWG.Since the wage is not available by employment episode, but by EE combination (annual income from each payer, regardless of the number of contracts that the worker has had with that payer), the initial database of 1,265,406 employment episodes is restructured in terms of EE combinationsthese combinations, 832,985 in total, constitute our sample units in the econometric models.As an EE relationship may have given rise to several employment episodes (labour contracts) in 2021, those explanatory variables directly related to the job placement are processed in the econometric models so that they are approximately representative of the annual EE relationship; these contract-dependent variables are the occupation group, the activity sector, the province of the work centre, the type of employment relationship and the type of contract.Specifically, for these five variables, we have considered in each EE relationship the category of the variable in which the worker has accumulated the longest duration during 2021 [3].Table A3 (Annex) shows the output of the wage equation estimation (wage in logarithms).We have estimated a model for men and women considered together ("Men and Women" model in the table), a model only for men ("Men" model) and a model only for women ("Women" model).Note that, unlike other existing estimates in the literature, we have incorporated into the estimates the association factor a ij {i 5 gender and age group; j 5 occupation group and sector of activity} corresponding to each EE sample unit.
Table A3 shows standard results in this area, with the best-paid workers being male, of intermediate age and longer duration in the company (in 2021), belonging to the public sector, with high educational or occupational levels and located in certain sectors of activity, such as financial and business services, supplies and extractive and manufacturing industries.Moreover, the elasticity of the annual income to the association factor of the labour segment {gender, age group; occupation, industry} is positive in the three estimates.This means that higher wages are expected for those workers who are in labour segments where workers of their age and gender group tend to be associated with the corresponding occupation group and sector of activity.Moreover, this effect is somewhat greater in the case of women (0.086 vs. 0.049), which indicates that being in a labour segment with a larger association factor generates a higher return in the women's group than in the men's group.
Table 2 shows the detailed output of the OB decomposition (Eq.( 1)).The estimation shows a differential of 13 logarithmic points in favour of menmen earn, on average, 13.8% { 5 exp(0.13)-1}more than women -, of which 0.03 are due to the endowment differences between men and women in the different covariates considered in the model, and 0.122 points are explained by the different returns that both genders obtain from those covariatesthis result is, to a certain extent, in line with the literature on Spain in this field.Note also that the effect of the interactions is À0.022 (favourable to women).
Next, we discuss the detailed results for the individual predictors.In the unexplained part of the model (coefficient component), it can be observed that men accumulate (on average) a higher duration in the same company (in the year 2021) and obtain a higher return for that durationsee Table 2.This means that given the female mean of the days worked in the company in 2021 (which enters the model in logarithmic format), the expected female wage would be 9.8% { 5 exp(0.09331)-1}higher if the return of that duration for women were the same as that of men (coefficient effect or unexplained part of the GWG).Similarly, the growth would be 8.8% { 5 exp(0.08416)-1} in the case of the variable worker's age (in logs).However, this positive effect is not observed in the association factor variable: women would earn 0.35% less if this variable were remunerated as in the case of men.Note that the value of the interaction is negligible for the duration in the company and the worker's age.In the

Labour market segmentation
coefficients component, we also find dummy categories for which the average wage of women would increase if the men's coefficients [4] were applied to the women's characteristics, for instance: the health sector (1.1%), the permanent (2.75%) or temporary (1.94%) contract and the public sector category (1.02%).Finally, for the model intercept (À15.9% 5 exp(0.17299)-1), the average wage of women would decrease if the men's constant were applied to womenin other words, when it comes to the constant of the model, being a woman contributes to reducing the wage gap.The case of the model constant is interesting since it includes the effects of unobservable variables not taken into account (i.e.not included in the model).
As for the explained part of the model (endowments component), we observe that, given the female return of the days worked in the company (in 2021), the expected female wage would be 5.9% { 5 exp(0.05764)-1}higher if the average level of that duration for women were the same as that of men.Likewise, a small but significant positive contribution to the wage gap is observed in the association factor variable (0.21%).Moreover, for some dummy categories, the average wage of women would increase if they had the same average characteristics as men; some of these categories are the occupation group of administrative assistants (salary 2.2% higher) and the activity sectors of Health (1.7%) and Construction (1.65%).For example, in the health sector, women have a greater representation than men (men 4.02% vs women 15.8%) but belonging to this sector penalises them compared to other sectors; therefore, the component b β health;f $ðX health;m − X health;f Þ turns out to be positive -note that we are simulating that women lose weight ðX health;m − X health;f < 0Þ in a sector that penalises them in wages ð b β health;f < 0Þ.On the contrary, some dummy categories for which the average wage of women would decrease if they had the same average characteristics as men are the public sector category (wage 1.76% lower) and the occupation groups technical engineers and graduate assistants (wage 1.1% lower) and 1st and 2nd officers (wage 1.9% lower).In the case of the public sector category and the first occupation group, because women are more represented than men and the return of the corresponding category is positive for them (women).In the second occupation group (1st and 2nd officers), because men are more represented than women, but women obtain a negative return for belonging to this group.Consequently, in these three categories (public sector and the two occupation groups pointed out), being a woman contributes to reducing the GWG.

Results by biclusters
(1) Job placement database In this section, the Spanish labour market is analysed considering that it can be endogenously divided into labour segments (biclusters) that are based on how workers, classified according to their gender and age group, match up with jobs, classified according to their occupation group and activity sector.Using the methodology described in Section 2 on the database of 1,265,406 job placements, we obtain the association factors table (or heatmap) of Figure 4.
The columns of the figure represent 16 crossed categories of workers (worker segments) that come from combining 2 genders and 8 age groups {25 years or less, 26-30 years, 31-35, 36-40, 41-45, 46-50, 51-55 and 56 years or more}, while the rows represent 100 crossed categories of jobs (job segments) that come from combining 10 occupation groups and 10 activity sectorsthe categories of occupations and activities are shown in Table 3.Both rows and columns have been, respectively, clustered to have an orderly view of the labour market.In Figure 4, we only show the column dendrogram, since the row dendrogram is too large (100 job segments).In addition, to better interpret the heatmap, (1) we have coloured the cells blue, with an association factor greater than one and (2) the higher the association factor of a cell with a ij > 1, a darker shade of blue.

IJM 45,10
Figure 4 allows us to reach conclusions of interest.The labour market to which women tend to go is visibly different from that of men.The column dendrogram (by gender and age group) shows that men and women are differently matched to different occupation and activity groupsnote that this dendrogram groups first by gender and then, within each gender, by age group.Moreover, the groups of men and women of 25 years or younger have little

Labour market segmentation
similarity with the rest of the groups and show a relatively high dissimilarity between them.Therefore, we can then deduce that there are significant gender differences in how young workers approach the labour market.Another difference between men and women is observed in the age group between 36 and 40 years.This group is initially arranged with the group of 31-35 years in the case of men and with the cluster of 41-50 in the case of women, giving the impression that age stigmatises women before men.Table 3 contains information about the rows (occupations and industries) of the heatmap from a gender perspective.In the table, we show the distribution of job placements (by gender and occupation group and gender and activity sector) of those placements that belong to the cells of the heatmap corresponding to the highest a ij quartile (Q1) for both men and womeni.e. the cells in Figure 4 with a more intense blue colour.In both cases (men and women), the top 75% of a ij distribution is reached approximately when this factor exceeds the value of 1.25.For comparative purposes, we also show in the table the job placement distributions for the entire sample.
Regarding the sectors of activity, the Q1-zone of the heatmap is very different for men and women.While women tend to be associated with service activities (especially health), men are associated mainly with extractive, manufacturing and construction industries and with services more typical of the private sector of the economy such as trade, hospitality, transport, communications and others.The women's job placements in health, education and public administration account for 33.7% of the job placements in the Q1-zone of the heatmap, this percentage being 15.7% if we analyse the entire heatmap.Interestingly, the finance and business service sectors are not a major focus of job attraction for men and women; in the matching game, they better prefer other sectors of activity.As for the occupation groups, the differences by gender in the Q1-zone of the heatmap are also notable.The women's job placements in the first four occupational groups (those with the highest qualification) exceed 20% in the Q1-zone (this percentage is 15.4% in the full heatmap), while the men's job placements do not reach 5% in these four groups (this percentage is approximately 13% for men in the full heatmap).Also noteworthy is the high relative weight of 1st, 2nd and 3rd officers in the case of men; in the darkest areas of the heatmap, male officers represent more than 30% of the job placements (18.4% if we look at the entire heatmap).
(2) Employee-Employer database The previous heatmap (Figure 4) allows us to observe the existence of labour market segments (biclusters) that can be analysed from the perspective of the GWG.To do this, we need to use the EE database, because it is the one that allows us to use the annual earnings from each EE relationship.As can be seen, the heatmap in Figure 4 has been divided into 7column clusters and 12-row clusters.This makes 84 cluster intersections of which we have selected the 21 that show a higher degree of association between their respective rows and columns.We refer to these 21 cluster crossings as biclusters or labour market segments, which can be analysed using the OB decomposition. Figure 5 relates, for each bicluster, the GWG and the explained and unexplained components (in percentage) of the OB estimate.Figure 5(a) shows that a larger wage gap is observed in those biclusters with a greater relative weight of the explained part of the gap; the opposite happens with the unexplained part of the model -Figure 5(b).These results would indicate that in those biclusters where wage differences are more important, these differences are mainly due to the characteristics of men and women and not so much to the different return of these characteristics.
Our microdata allows an in-depth analysis of any bicluster of interest.As an example, we describe in Table 4 the three biclusters that appear labelled in Figures 5 and 6.These are biclusters with a significant volume of job placements, one where women show a relatively high a ij factor, another where this happens to men and a third one where this happens to both genders.
The bicluster where women show a relatively high association factor (average women a ij 5 1.48, average men a ij 5 0.97) has 20,658 women's job placements and 14,009 men's job placementsthis bicluster is labelled as "Women show higher a ij " in Figures 5 and 6.We are talking about workers between 26 and 35 years old, administrative assistants, technical engineers or graduate assistants in the private service sector (commerce, hospitality, transport, communications and other services).In this bicluster, the wage gap is 0.18 logarithmic points (in favour of men) and is explained in almost equal parts by endowments and coefficients.For its part, the bicluster where men show a relatively high association factor (labelled as "Men show higher a ij " in Figures 5 and 6; average women a ij 5 0.8, average men a ij 5 1.33) is made up of workers aged 41 or older and has 32,110 job placements for women and 71,070 for men.This bicluster covers private service activities (like the previous ones), agriculture and extractive and manufacturing industries (it is quite transversal) and corresponds to officers or low-skilled workers.In this bicluster, the wage gap is 0.43 logarithmic points (in favour of men) and is mainly due to the characteristics of each one, although the unexplained part of the gap is also important.Finally, the bicluster where men and women show high association factors (label "Both show high a ij " in Figures 5 and 6; average women a ij 5 1.42, average men a ij 5 1.48) is made up of workers aged 25 or less (31,682 job placements for women and 44,577 for men) with low qualifications (over 18 years unqualified or under 18 years, or 3rd officers and specialists) in the private service sector.In this bicluster, the wage gap is 0.26 logarithmic points (in favour of men) and is mainly due to the characteristics/endowments of each one.
Obviously, we cannot think of applying common labour policies to labour market segments with such different characteristics, wage gaps and gap decompositions.For example, while in the bicluster where men and women both show a high association factor (bicluster of young workers), it would be necessary to investigate why the endowments are so favourable to men in terms of remuneration; in the other two biclusters, it would also be necessary to investigate what factors explain that women obtain lower returns than men for their contribution to the productive activity.
Our segmentation analysis ends by relating the average GWG of each bicluster (wage difference in logs) to the ratio of the respective average association factors of women and men in  Labour market segmentation

Conclusions
In this study, we have tried to show that the process of labour matching and the possible existence of gender wage discrimination in the Spanish labour market are phenomena that cannot be studied by treating the whole population/sample (employment episodes) homogeneously.On the contrary, it is necessary to segment the employment to find idiosyncratic labour markets that can be analysed from a gender perspective.Using matched EE data for the year 2021 in Spain (we use the MCVL provided by the Spanish Social Security), we take four variables that are key to segmenting the labour matching process: the gender and age group of the worker and the occupation group and activity sector of the job placement.By applying CT and hierarchical clustering techniques, we create a heatmap (clustered association table) of employment episodes in the year 2021 that allows us to identify employment biclusters (idiosyncratic labour markets) where workers of one gender (men or women) show a higher degree of association than workers of the other gender with certain sectors of activity and occupation groups and vice versa; the analysis is further refined by discriminating workers by different age groups.
Our study provides an in-depth analysis of workers' remuneration in the Spanish economy.We estimate a wage equation for the full sample and for men and women separately and perform the OB decomposition to try to explain the existing wage differential in favour of men.The OB estimation shows a wage differential of 13 logarithmic points in favour of men, most of it due to the different returns that both genders obtain from their respective matching-related attributesthis result is in line with the literature on Spain in this field.Additionally, we analyse the GWG in different idiosyncratic markets which are defined by different clusters of occupation groups and activity sectors where certain clusters of men and/ or women (of different age groups) tend to seek employment and get a job.These crossings of clusters with high internal association (denoted as biclusters) are extracted from the mentioned employment heatmap.The application of the OB model to these idiosyncratic markets based on "who matches with whom" constitutes a novel aspect within the literature on gender wage discrimination.
Our gender labour segmentation analysis shows that the labour market to which women attend is visibly different from that of men.For instance, while women tend to be associated with service activities (especially, public services, such as health or education), men are mainly associated with extractive, manufacturing and construction industries and with private services.Furthermore, it is observed that women are associated more intensely than men with the highest occupation groups (those with the highest qualification).This segmented scenario implies that the phenomenon of gender wage discrimination cannot be analysed with a global and homogeneous vision of the labour market and addressed with general policies.Effectively, the different idiosyncratic markets extracted from our employment heatmap show different wage gaps, as well as different weights of the observed (endowments) and unobserved (coefficients) heterogeneity.The overall analysis of these labour market segments produces two interesting conclusions.On the one hand, women tend to be placed in those labour biclusters where the wage differentials with men are smaller.For women, not only is the wage level important, but also the situation of wage inequality with respect to male workers.On the other hand, in those biclusters where wage differences are more important, these differences are mainly due to the characteristics of men and women, and not so much to the different return of these characteristics, although this last component is not negligible.In fact, when we use data from the entire labour market, which would imply using information from the entire heatmap (and not just from the idiosyncratic biclusters), the unexplained component of the OB decomposition explains most of the wage gap.
Our combination of methodologies (clustered contingency tables and wage gap decomposition) is versatile and flexible (it can be applied to other economies or worker IJM 45,10 groups: migrants/natives, public/private employees, university graduates/other educational levels, etc.) and provides a better understanding of the underlying segmentation in the labour matching process and the effect of this segmentation in the gender wage issue.A more comprehensive knowledge of the underlying structure of the labour market helps in the efficient design of labour policies from a gender perspective; for instance, it would be necessary to act in those idiosyncratic labour markets where men continue to earn more than women fundamentally for reasons that are not justified by their respective characteristics.Our methodology allows us to identify these markets.The quantile analysis of the wage differential by labour segments, the consideration of the regional dimension in the heatmap and the application of our methodological tools to specific groups in the labour market are other possible lines of extension of our research.
Notes 1.The concept of association factor was introduced by Good (1956).
2. This last sector includes activities such as professional, scientific and technical activities, administrative and support service activities, recreational, cultural and sports activities, real estate activities, computer activities, associative activities and activities of households as employers or producers (self-consumption).
3. We have also estimated the models considering the two longest duration categories of each jobdependent variable, but we have discarded these estimates because the improvement in model fit is small and more degrees of freedom are lost in the estimation.
4. Observe in Table 2 that the OB model transforms the coefficients of the dummy variables so that they reflect deviations from the "grand mean" (in other words, the modified coefficients will sum up to zero over all categories) rather than deviations from the reference category.This deviation contrast transformation allows the model output to be invariant to the choice of the (omitted) base category.On this transformation, see for example Yun (2005).5. To obtain the average association factor for each bicluster, we have created an auxiliary 7 3 12 CT where each cell represents the job placements for the crossing of the corresponding row cluster and column cluster.Our methodology of associations can be applied to this more compact CT to obtain the association factor between each cluster crossing.
Figure 3. Wage distribution by gender (logarithms) Figure 4. Heatmap of the Spanish labour market Figure 5. GWG gap and OB components by biclusters

Table 2 .
Standard errors adjusted for 640,214 clusters of individuals (workers).Other control variables: Worker attributes (Nationality, Social benefits received in 2021, Income received in 2021 from professional activities, Number of labour contracts with the company during 2021, Duration since the first contract with the company (years), Collective agreement); Job placement attributes (Province, Percentage of the income from work that is in kind); Firm attributes (Legal person vs. natural person, Number of workers (logs)).The complete table is offered as supplementary material (available online) Source(s): Authors' own work based on MCVL