Should different countries participating in PISA interpret socioeconomic background in the same way? A measurement invariance approach

It has been claimed that there is a lack of theory-driven constructs and a lack of crosscountry comparability in International Large-Scale Assessment (ILSA)’s socio-economic background scales. To address these issues, a new socio-economic background scale was created based on Pierre Bourdieu’s cultural reproduction theory, which distinguishes economic, cultural and social capital. Secondly, measurement invariance of this construct was tested across countries participating in the Programme for International Student Assessment (PISA). After dividing the countries which participated in PISA 2015 into three groups, i.e., Latin American, European, and Asian, a Multi-Group Confirmatory Factor Analysis was carried out in order to examine the measurement invariance of this new socio-economic scale. The results of this study revealed that this questionnaire, which measures the socio-economic background, was not found to be utterly invariant in the analysis involving all countries. However, when analysing more homogenous groups, measurement invariance was verified at the metric level, except for the group of Latin American countries. Further, implications for policymakers and recommendations for future studies are discussed.


Introduction
International Large-Scale Assessments (ILSAs) have been given much attention due to the ever-increasing participation rate across countries in the world (Addey, Sellar, Steiner-Khamsi, Lingard & Verger, 2017). Retrospectively speaking, the International Association for the Evaluation of Educational Achievement (IEA) carried out the first ILSA in 1960, with the participation of twelve pilot countries (Addey & Sellar, 2018). By the end of the 1990s, the number of participating countries was approximately 40 (Tijana & Anna, 2015). Nowadays, nearly 70% of countries across the world participate in these evaluations (Lietz, Cresswell, Rust & Adams, 2017). Table 1 shows a selection of recent ILSAs and the respective number of participating countries for reference. The OECD's Programme for International Student Assessment (PISA) shows the highest number of participating countries, compared to other ILSAs. Participation in PISA has also significantly increased over time. In 2000, 43 countries participated in this assessment, whereas 72 took part in 2015, and 80 in the latest round which took place in 2018 (Steiner-Khamsi, 2019). Since 2000, the proportion of countries participating in PISA has almost doubled worldwide.
There are two main objectives behind the application of ILSA studies: contributing comparatively to the functioning of educational systems, as well as illuminating the development of educational and training programmes in participating countries from many diverse regions (Torney-Purta & Amadeo, 2013). In this context, ILSAs introduce a major challenge relating to comparability, in that their underlying tools should enable sensible cross-country comparisons to comply with their aims (Goldstein, 2017;Segeritz & Pant, 2013).

111
The design process of ILSAs requires to adhere to rigorous standards to make comparisons possible across a wide range of participants, which are diverse in terms of culture, and economic and political contexts (Miranda & Castillo, 2018). Results of these assessments should be comparable because, as Mullis (2002, p. 2) states, they "provide an opportunity to examine the impact on achievement of different educational approaches and additional insight into ones' educational system". To meet these requirements, measurement instruments should ensure that participants who hold the same level of a certain characteristic obtain the same score in the test.
It is in this context that measurement invariance becomes a key condition that needs to be verified in these studies. The design and implementation of measurement instruments should allow all countries participating in ILSAs to be reflected in an equal manner. Measurement invariance should be taken into account as a significant matter in order to make group comparisons, that is to say, only if measurement invariance is ensured, then researchers can make comparisons between different cultures (Van de Vijver & Leung, 1997; Van de Vijver & Poortinga, 2002;Byrne & Van de Vijver, 2010). To put it differently, as long as a given scale's measurement invariance is confirmed among relevant groups, scores obtained from it can be used to make a comparison across groups (Uysal & Arıkan, 2018). Conversely, if measurement invariance is not verified, both the validity of the scores and interpretations, and the fairness of the measurement process remain disputable (Gregorich, 2006). As a natural consequence of this, interpretations, and conclusions about group differences across countries may not be valid (Cheung and Rensvold, 2002). The question on cross-cultural comparability of cognitive assessments in ILSAs has had considerable attention in the literature (e.g. Wu, 2010;Klieme, 2016& Oliveri & Ercikan, 2011. Numerous studies have addressed the question of measurement invariance for PISA cognitive assessments, while less attention has been paid to PISA context questionnaires (Van de Vijver, 2018) (i.e. student questionnaires, e.g. He et al., 2018). Although Hopfenbeck et al. (2018) explicitly states that measurement invariance is just as important for background questionnaires, Rutkowski and Rutkowski (2010) highlight that for all participating countries, background questionnaires comparability has not been explored to the same extend as the cognitive assessments. Despite this, background questionnaire responses from participants are still utilized to make approximate estimations of the population and subpopulation achievements by using linear regression models (Rutkowski & Rutkowski, 2010). In light of this, 'the degree to which a single measure of socioeconomic background is reliable and valid for all participating countries is not widely discussed' (Rutkowski & Rutkowski, 2013, p. 260).
Socioeconomic status (SES) is one of the most frequently used predictive factors of academic achievement in the literature (Sirin, 2005;White, 1982). The socioeconomic background of students has increasingly become essential in educational research to determine whether there is segregation, differences, or inequalities between students in ILSAs, especially in PISA. In fact, this aspect is included in the fourth United Nation's Sustainable Development Goal (SDG4;UN, 2015), which aims to ensure inclusive and equitable quality education and promote opportunities for all students. After the study of Coleman et al. (1966), the link between socio-cultural and economic status and academic achievement has been demonstrated. To date, it has been clearly stated that SES is of great importance as an indicator. It has been integrated to studies on students' educational outcomes as a supplementary component (Bornstein & Bradley, 2003;White, 1982;Neff, 1938;Bradley & Corwyn, 2002;Sirin, 2005). For instance, Sirin (2005) review's findings highlighted that student's educational achievement is significantly affected by the socio-economic structure of families.
There are numerous studies addressing the association between SES and student academic achievement in the context of cross-cultural studies, particularly using PISA data (e.g., Park & Sandefur, 2016;Thein & Ong, 2015;Kalaycioglu, 2015;Pokropek et al. 2015). Nonetheless, there have been two fundamental criticisms regarding the use of SES in PISA, particularly when addressing questions on socio-economic unevenness: the lack-of-theory issue, and the problem of comparability. First, it is critical to note that, in general, decisions about what will be included in ILSA studies are made without taking into account existing theories, and analyses tend to only draw on statistical measures, such as correlations and regression models (Lauder et al. 1998;Coe & Fitz-Gibbon 1998). In that sense, the need to consolidate and understand the theoretical frame regarding socio-cultural and economic status as measured in ILSAs has emerged. Second, there is a fundamental debate as to whether SES has the same meaning across countries, particularly in terms of the indicators measuring this construct (Rutkowski & Rutkowski, 2013). This is a question on the validity of the interpretations made around SES and whether it can be measured across countries that have diverse contexts and conditions. Pokropek et al. (2017) gave an illustrative example in this point: Having a car may not indicate socioeconomic status in the same way in the United States as it does in Japan. While in the United States car ownership is virtually universal (because distances between locations are large and the costs of maintaining a car low), in Japan car ownership is less common even in relatively wealthy families (as public transportation is widespread and efficient, and the cost of maintaining a car is high (p.244).
In order to address the first of the abovementioned criticisms, this paper aims to obtain and establish experimental confirmation for Pierre Bourdieu's cultural reproduction theory in order to theoretically support PISA's socio-economic status construct. This will be done by constructing one scale which does not originally exist in PISA in accordance with this theory. Cultural reproduction theory will be explained in detail in the literature review section.
Secondly, this paper aims to test the measurement invariance of the socioeconomic status construct across countries participating in PISA 2015 (OECD, 2018). When we look at the structure of PISA 2015, it can easily be stated that participating countries comprise of a wide range of populations, which includes different cultures, economic systems, and diverse spoken languages. The measurement invariance of PISA's SES structure has been tested across all countries but has not been properly confirmed (e.g., Rutkowski & Rutkowski, 2013;Pokropek et al., 2017) To make cross-group comparisons more logical and reasonable, and based on the formation of more homogeneous groups, PISA 2015 participating countries will be split into three groups, i.e., Latin America, Asia, and Europe, considering the regions they belong to. While dividing the participating countries into three geographical groups, countries with similar cultural, historical and macroeconomic backgrounds were considered as a single group.
In summary, this paper intends to give theoretical support to the socio-economic status construct in PISA and, consequently, to verify whether this scale shows measurement equivalence across PISA participating countries. Therefore, this study aims to illustrate whether the questionnaire designed to measure the socioeconomic background of students who participated in PISA 2015 represents the same meaning across countries, particularly when grouped according to their region/continent. The results of this study will provide valuable information to improve those measures relating to concepts like socioeconomic status and the methods currently used to analyse its association with educational outcomes. National and local governments, as well as international organisations in charge of implementing this kind of assessments could be the main beneficiaries of the conclusions developed in this research.

Literature review
This section looks to address firstly the current lack of theory supporting ILSAs' SES constructs, particularly in works that use PISA data (Caro & Cortes, 2012) and, secondly, the lack of evidence supporting cross-cultural comparisons of these constructs. Hence, the literature will be organised around two main topics: (1) Pierre Bourdieu's cultural reproduction theory, (2) crosscultural research works using SES indicators in ILSAs. Cultural reproduction theory will be discussed because a connection will be established between this theory and our proposed SES construct. Cross-cultural research using ILSA data will be reviewed in order to show the lack of empirical evidence to sustain the validity of comparison of SEs constructs across countries.

Cultural Reproduction Theory of Pierre Bourdieu
SES is described as a structure resulting from the combination of many components based on social, cultural, and economic factors such as individual's education level, household income, occupation, and home possessions. In the same way, the concept of capital pointed out by Bourdieu (1986) and Coleman (1988), expressed as three types of capital, i.e., economic, cultural and social, has been used in studies by most researchers to reveal the possible association between the family's socio-economic status and students' academic achievement. Capital is defined as a notion that "takes time to accumulate and which, as a potential capacity to produce profits and to reproduce itself in identical or expanded form, contains a tendency to persist in its being" (Bourdieu, 1986, p. 241).
Three forms of capital can be identified (Bourdieu 1986, p.242): economic capital, "which is immediately and directly convertible into money and might be institutionalized in the form of property rights"; cultural capital, "which is convertible, in certain conditions, into economic capital and might be institutionalized in the form of educational qualifications"; and social capital, "which is convertible, in certain conditions, into economic capital and might be institutionalized in the form of a title of nobility". Bourdieu (1986) highlights that economic capital is the root of the other types of capital. In other words, cultural and social capital are a result of the modification of economic capital. Family income might lead to resources that allow them to participate in after-school activities as well as to reach high-quality instructional facilities and to build linkage with others (Lareau, 2011). There are three forms of cultural capital (Bourdieu, 1997): incorporated or embodied cultural capital, objectified cultural capital and institutionalized cultural capital. Embodied cultural capital includes linguistic and cognitive competencies, cultural habits and tendencies. Objectified cultural capital contains possession and cultural goods, e.g., books, paintings. Institutionalized cultural capital comprises formal educational qualifications such as diplomas or certificates. It was revealed that cultural capital of students had significant effects on academic achievement (e.g. Yang, 2003;Barone, 2006). As DiMaggio (1982, p.190) points out: '[teachers] communicate more easily with students who participate in elite cultures, give them more attention and special assistance, and perceive them as more intelligent or gifted than students who lack cultural capital'. Social capital is expressed as belonging to a certain group based on the principle of recognizing and interacting with one another (Bourdieu, 1986). One reason for the differences in the educational level of students is the social capital produced as a result of the connections and interactions of the families at different levels (Rogosic & Baranovic, 2016).
(2016) considered social and cultural capital whereas Caro et al. (2014) considered all three types, including economic capital. Education studies have traditionally conceptualised social inequality as a multidimensional phenomenon (Abel, 2008), however, most studies do not address the complex structure of cultural, economic and social capital. At least in quantitative studies, it is very rare to find studies where an integrated SES structure is considered. To address this gap, in this paper we designed a model considering economic, cultural, and social capital drawing on PISA 2015 socioeconomic background questionnaires.
Measurement invariance analysis has been frequently and widely used over the last decade and continues to attract interest. During the past years, much attention has been paid to testing measurement invariance of ILSAs' cognitive assessments. Wu, Li & Zumbo (2007) investigated the measurement invariance of the mathematic test using TIMSS 1999 data across seven countries but found that invariance was not supported. In the Italian context, Alivernini (2011) tested the measurement invariance of PIRLS 2006's reading literacy scale across students' gender and their immigration status and results showed that making such comparisons was not empirically supported.
Recently, studies have shifted their attention towards background questionnaires. For example, Segeritz and Pant (2013) examined the measurement invariance of the PISA 2003's Students' Approaches to Learning instrument across immigrant groups in Germany and did not achieve all levels of invariance. In Turkey, Demir (2017) explored the measurement invariance of students' affective characteristics across gender categories and found that this scale was largely comparable between gender groups.
There is a limited number of studies addressing the measurement invariance of the socio-economic status indicator. In the United Kingdom, Hobbs and Vignoles (2007) stated that Free School Meal (FSM) Eligibility, which has been commonly used as a proxy for SES in UK educational research, has not enough supporting evidence to make comparison across families with dissimilar characteristics. Lenkeit et al. (2015) reveal that -using data from the Children of Immigrants Longitudinal Survey in Four European Countries (CILS4EU) in England -there are differences across immigrant groups in terms of the family background construct.
With regard to ILSA data, few studies relating to the measurement invariance of SES have been conducted. Hansson and Gustafsson (2013) found that invariance of SES was supported when comparing Swedish and non-Swedish populations, using TIMSS 2003 data. Rutkowski and Rutkowski (2013) found that the home possession indicator present in PISA 2009 SES index was not comparable across the 65 participant countries. Furthermore, Hernandez et al. (2019) explored the comparability of different socioeconomic scales of three ILSA studies: TERCE, PISA and TIMSS. None of the socioeconomic background scales was found to be fully invariant, which suggested that comparisons across countries should be made with caution.
Caro, Sandoval-Hernandez and Lüdtke (2014) highlight that, when using SES variables for making comparisons, recommendations or comments about participating countries, researchers should be extremely attentive and careful as comparisons are not fully supported by the evidence. Correspondingly, Hopfenbeck et al. (2018) emphasized in their systematic review that numerous articles suggest policymakers and researchers be careful and cautious when using PISA data as a valid benchmarking or informed policy-making tool.

Sample
PISA is a triennial survey which was firstly launched by the Organisation for Economic Co-operation and Development (OECD) in 2000. The PISA 2015 study was administered in 35 OECD and 37 non-OECD (partner) countries. PISA implements a two-stage stratified sampling strategy. In the first stage, schools are sampled using a probability selection on the basis of the number of students enrolled in the school. In the second stage, a certain sample of students is randomly selected within each school. 540.000 students took part in PISA 2015, representing about 29 million 15-year-olds in schools of the 72 participating countries (OECD, 2018). More detailed information of the sampling design, including weighting procedures can be found in the PISA 2015 Technical Report (OECD, 2017). To explore cross-cultural comparability across countries, the current study considered 35 OECD countries and 19 partner countries (a total of 54 countries). The rest of partner countries were removed from the analysis due to not having valuable information for some variables.

Measures
Nine subscales included in the PISA 2015's student questionnaire were selected to create a new SES scale based on Pierre Bourdieu's cultural reproduction theory. Indexes were used as indicators rather than each individual item, except for 'number of books' (ST013Q01TA). Table 2 indicates the items used to develop the new scale.  Table 3 shows the respective descriptive statistics. Items were grouped into three groups indicating whether they measure economic capital, cultural capital or social capital.

Analytical Strategy
The psychometric characteristics of the created scale were evaluated following a number of procedures. First, reliability (internal consistency) was evaluated using Cronbach's alpha coefficient (Cronbach, 1951). This coefficient ranges from 0 to 1, with values close to 1 indicating high levels of reliability. Second, a confirmatory factor analysis was implemented to evaluate the model fit for each country (see more information in the results section). We then applied a multi-group confirmatory factor analysis (MG-CFA) to examine the model fit and cross-cultural comparability of this scale across all education systems. Lastly, countries were split into three different sub-groups (Latin American countries, Asian countries, and European countries) to examine the cross-cultural comparability of this scale within more homogeneous groups.

Confirmatory Factor Analysis
Models were estimated using maximum likelihood (ML). Model fit was tested using the Comparative Fit Index (CFI) and the Tucker-Lewis index (TLI) as goodness of fit statistics, and the root-mean squared error of approximation (RMSEA) and the standardized root mean-squared residual (SRMR) as residual fit statistics. It is important to highlight that the closer the CFI and TLI values are to 1, and the closer the RMSEA and SRMR values are to 0, the better model fit. Acceptable model fit was given by CFI >.90; TLI > .90; RMSEA < .10; and SRMR < 0.08 as proposed by Hu and Bentler, (1999) and Rutkowski and Svetina (2014).

Cross-cultural Comparability
MG-CFA is a method widely used to test measurement invariance (Widaman & Rice, 1997;Vanderberg & Lance, 2000;Hair et al. 2010;Kline, 2011;Milfont & Fischer, 2015). MG-CFA is a continuation of classic CFA, and it is based on multi-group comparison. It divides the data into groups and determines the model fit for each one of them (Kline, 2011;Bialosiewicz, Murphy & Berry, 2013). MG-CFA is also widely used to test measurement invariance, where different levels of comparability must be explored, i.e., configural invariance, metric invariance, and scalar invariance (Kline, 2011;Vandenberg & Lance, 2000).
Configural invariance constitutes the first step when testing measurement invariance. It is associated with a model where the latent structure is equivalent across groups (Kline, 2011), i.e., the common factors and items measuring these factors are the same (Vandenberg & Lance, 2000). Although achieving this level of invariance does not mean that the groups are comparable, it is a prerequisite for testing other invariance levels (Kline, 2011).
Metric invariance implies that each group has equal factor loadings (Kline, 2011). If this level of invariance is verified, latent variances and covariances between latent variables can be compared (Millsap & Olivera-Aguilar, 2012). When metric invariance conditions are not met, that implies items/ indicators do not have the same meaning across groups (Gregorich, 2006). Scalar invariance must be verified after metric invariance has been tested. This level implies that item constants/intercepts are equivalent among groups (Millsap & Olivera-Aguilar, 2012) and that latent and observed variable 120 means are comparable (Kline, 2011;Gregorich, 2006). In other words, if scalar invariance conditions are met, this will allow us to compare the level of the latent variable among different education systems.
Finally, strict invariance is the last level of invariance that can be tested and implies that residual covariances are equivalent across groups (Brown, 2015). However, this last step was not taken into account in this study as the scalar level was considered sufficient to make meaningful comparisons of latent factors across education systems (Meredith, 1993).
Two approaches to test measurement invariance are generally accepted in the literature: the chi-square (χ2) test and changes in CFI and RMSEA statistics (Byrne & Stewart, 2006;Cheung & Rensvold, 2002). In this study, Δχ2, ΔCFI, ΔRMSEA were calculated and assessed. Using the chi-square test to decide on the overall model fit is said not to be reasonable in this context due to the large sample sizes (Rutkowski and Svetina, 2014). Therefore, ΔCFI and ΔRMSEA values were assessed in order to determine metric and scalar invariance, drawing on the criteria suggested by Rutkowski and Svetina (2014) when analysing large and variable sample sizes and a large number of groups. To determine metric invariance, these authors provide a slightly more liberal criterion of around -0.020 for ΔCFI and 0.030 for ΔRMSEA.
To determine scalar invariance, the traditional cut-off values were taken into consideration, i.e., -0.010 for ΔCFI and a ΔRMSEA of 0.010.

Findings
First, an overall reliability estimate and CFA results are provided, as well as country-level reliability estimates and CFA results for a model that consists of economic (ECN), cultural (CLT) and social capital (SCL). Next, measurement invariance analysis results are presented considering the three abovementioned groups: Latin American countries, Asian countries, and European countries. Figure 1 shows overall CFA results and Table 4 shows country-level reliability estimates as well as country-level CFA models. The overall reliability was good (Cronbach's alpha = 0.7). Factor loadings ranged from 0.31 to 0.87, error variances ranged from 0.25 to 0.90 as shown in Figure 1. Results indicate that this model including all countries shows a weak fit to the data (χ2 = 77538.351; DF = 24; CFI = 0.893; TLI = 0.84; RMSEA = 0.094; SRMR = 0.059). However, this model was further considered in the analysis as it is a theory-based model. With regard to country-level results, reliability estimates ranged from 0.77 (BSJG China) to 0.59 (The Netherlands). Whereas in OECD countries, the average reliability estimate was 0.66, ranging from 0.74 to 0.59, in partner countries, the average reliability was 0.69, ranging from 0.77 to 0.60.
Country-level CFA models are shown in Table 4. As can be seen, no country met the minimum TLI cut-off value of 0.90. For this reason, countries that satisfy the minimum criteria in three of the four fit measures are shown in bold. A total of 19 nations reached three fit measure cut-off values, of which 12 were OECD countries, and 7 were partner countries. Among the partner countries, there were three Asian countries (Chinese Taipei, Hong Kong and BSJG China) and three Latin American countries (Colombia, Costa Rica and the Dominican Republic), while there was only one European country (Russian Federation). All OECD countries were European countries, except for Korea.
Although there are education systems with relatively adequate fit both in OECD countries and in partner countries, there are still some education systems that do not show a good fit to the data. This evidence does not support cultural reproduction theory as an accurate model for some educational systems in this study. Particularly, the model poorly fitted in Canada (CFI=0.835 and TLI=0.752) and in New Zealand (CFI=0.826 and TLI=0.739). It is important to point out that a well-fitted CFA model is essential before examining measurement invariance. Although not all education systems showed a good fit, invariance analyses were carried out because most of the education systems did. Table 5 shows the baseline, configural, metric and scalar invariance models and their respective fit measures considering the 54 education systems. As can be observed, the baseline model showed fit indices slightly within acceptable levels. When moving from the baseline model to the configural model, the fit indices did not show much visible variation. Moving from the configural model to the metric model, the variation in fit indices was a minor. The change in RMSEA was of an acceptable level, while the change in CFI was not within the expected value. Moving to the scalar invariance model, fit indices worsened notoriously and changes in CFI and RMSEA were not acceptable. These results clearly indicate that neither metric nor scalar levels of invariance were reached, and thus it is not possible to compare latent variances, covariances and means across all participating countries. In the following stage, countries were grouped into three regions, namely, Latin America, Asia and Europe. Next, a MG-CFA was carried out within each group in order to explore whether measurement equivalence was supported. Table 6 shows results for Latin American countries. The baseline model showed satisfactory fit indices. Moving to the configural model, there was a slight improvement in terms of fit indices. When moving from the configural to the metric level, it can be seen that the CFI value decreased from 0.89 to 0.86 and the RMSEA value increased from 0.92 to 0.95. These differences are just over those proposed by Rutkowski and Svetina (2014).

124
Moving to the scalar model the CFI changed from 0.86 to 0.78 and RMSEA from 0.096 to 0.111. These results demonstrate that factor loadings and intercepts are not equivalent across Latin American countries, and thus no comparisons between latent variances, covariances and means can be made. In Asian countries (see Table 7) the baseline model fit indices were CFI = 0.88, TLI = 0.82 and RMSEA = 0.096. Moving from the baseline model to the configural model, there was an increase in fit indices from 0.88 to 0.89 for CFI, from 0.82 to 0.84 for TLI and from 0.096 to 0.084 for RMSEA. When moving from the configural to the metric model, the changes in CFI and in RMSEA were within acceptable levels. Using Rutkowski and Svetina (2014)'s criteria, results indicate that factor loadings are equivalent across Asian countries. CFI reduced from 0.87 to 0.66 and RMSEA increased from 0.083 to 0.124 when moving from the metric invariance model to the scalar invariance model, which is higher than the expected values. Again, these results indicate that intercepts are not equivalent across Asian countries, and thus no comparison between latent means can be made. In European countries (see Table 8), the baseline model's fit indices were acceptable (CFI = 0.90, TLI = 0.86 and RMSEA = 0.080). Moving from the baseline model to the configural model, fit indices marginally worsened (CFI from 0.90 to 0.89 and TLI from 0.86 to 0.84). Similarly, when moving from the configural model to metric model, no considerable change in fit indices was observed. CFI reduced from 0.89 to 0.87 and RMSEA remained unchanged. These values are below those proposed by Rutkowski and Svetina (2014), which suggests that factor loadings are equivalent across countries. There was an extreme deterioration of model fit indices when switching from the metric model to the scalar model as changes in CFI and RMSEA were not of an acceptable level.

Discussion
Identifying the differences in student academic achievement across countries is one of the main challenges facing education designers and practitioners who especially dedicate themselves to eliminate disparities among students across the world. Although the scale that measures the socio-economic background explains this difference to a great extent, the adequacy of this scale in explaining this difference remains to be discussed as there is a wide variety of groups in PISA. There occur still two main criticisms to studies based on secondary analyses of PISA in education, which are this scale's lack of theoretical background to formulate the hypotheses that they test, and the alleged lack of comparability of this construct. Therefore, theoretically supporting the underlying mechanisms of SES and making valid comparisons of this measure across countries are essential requirements.
The primary purpose of this paper was to address these criticisms by using items included in the PISA 2015 student background questionnaire to create a SES scale based on Bourdieu's reproduction theory (i.e. latent variables measuring students' economic, cultural and social capital) and to test the measurement invariance of these constructs across PISA participating countries. In other words, this study aimed to develop a reasonable theorybased structure from PISA existing indicators and to examine the comparability of this theory-supported structure across countries.
Regarding the first criticism, we have formalized a model that consists of economic, cultural and social capitals considering cultural reproduction theory of Pierre Bourdieu. On the one hand, economic and cultural capital measures were selected based on this theory and a wide range of related previous studies. WEALTH, PARED, and HISEI indicators for economic capital and CULTPOSS, HEDRES and number of books for cultural capital showed higher factor loadings, which is similar to Caro et al. (2014)'s findings. On the other hand, COOPERATE, CPSVALUE and EMOSUPS indicators showed acceptable factor loadings for social capital. This factor, however, is a multidimensional concept that cannot be easily measured using the available data.
Our results showed that a construct derived from PISA's indicators did not support cross-cultural comparability across all countries, but just at the configural level. However, after countries were split into more homogeneous groups (Latin America, Asia, Europe), cross-cultural partial comparability was supported. Rutkowski and Rutkowski (2018) have pointed out that ILSAs include linguistically, geographically, economically, and culturally diverse participating countries. Therefore, they suggest that well-structured countryspecific indicators should be produced rather than single indicators for all participating countries. This way, it would be possible for each participating country to incorporate their territorial conditions into comparable international scales (Rutkowski and Rutkowski, 2018;Sandoval-Hernandez et al., 2019).
The results of this study provide solutions and recommendations that should be considered and implemented. The analyses including all countries, do not support comparisons across education systems when using this socio-economic status scale, as neither the metric level of invariance nor the scalar level were reached. This may be partly related to regional and socio-cultural factors, as well as language as stated by Lee (2019) in her work focusing on the home possessions scale. Although the Latin American group did not achieve the metric invariance level, results were close to acceptable values. In both Asian and European countries, the metric level was achieved but not at scalar invariance level. Sandoval-Hernandez et al. (2019) have highlighted that in TERCE -a much more regional assessment -the socio-economic background scale reached the metric level of invariance. Our suggestion goes in line with what Rutkowski and Rutkowski (2018) state, in that ILSAs would benefit from the "the active involvement of countries or regions to develop and include more country or regional options into the background questionnaire" (p.365).

Limitations
This study has limitations that should not be ignored. Firstly, it is worth mentioning that the variable 'number of books' is categorical. In order to carry out the analysis in a way that takes into account the survey design, variables must be continuous. However, Liu et al. (2017) state that if there are more than five response categories in ordered-categorical data, it may be acceptable to analyse them as continuous data. Since this variable has more than five response categories, it is reasonable to assume that there was no significant bias in parameter estimation.
Another limitation is that social capital is an indicator of socioeconomic background, however, there are not enough items that capture and measure social capital in PISA. Therefore, this study encourages policymakers and educational research designers to consider this and act towards this direction. Moreover, social capital is an extensive and multidimensional notion that comprises different dimensions such as structural, cognitive and relational factors. We have mostly conceptualised social capital using variables relating to interpersonal relationships and parental responsibility in education. Since indicators of social capital are limited, we could not focus on all aspects of this construct.
It is worth noting that this study was carried out to determine whether a SES scale was invariant and did not focus on the reasons triggering invariance. In this context, if measurement invariance is detected in a given step, successive analyses should be carried out to determine the reasons for this invariance before proceeding to the next stage.

Conclusion
This paper has supplied evidence that PISA indicators of socioeconomic background have serious psychometric deficiencies when used to elucidate differences in educational achievement across different educational systems. Further investigation on the comparability of other scales included in PISA's background questionnaires, such as teaching practices, could be carried out given the diversity of participating countries. Such studies are necessary because PISA's report provides information on such scales, and many researchers around the world use these variables to explain academic achievement. Making an evidence-based comparison among countries is undoubtedly a need for educators in each country.
As revealed in this study, when dividing countries into groups according to region/continent, comparability across education systems of some background scales could be supported by evidence. In that sense, two alternatives can be considered. On the one hand, ILSAs could use continentspecific or country-specific items for its background questionnaires. On the other hand, the process of developing background questionnaires could be benefitted from more heterogeneous groups of experts that represent different countries and languages. By adopting these suggestions, the necessity of designing assessments with a focus on specific regions will have been addressed.