Leveraging supplementary modalities in automated real estate valuation using comparative judgments and deep learning

Purpose – Inthisstudytheauthorsaimtooutlinenewwaysofinformationextractionforautomatedvaluation models,whichinturnwouldhelptoincreasetransparencyinvaluationproceduresandthuscontributetomore reliablestatementsaboutthevalueofrealestate. Design/methodology/approach – The authors hypothesize that empirical error in the interpretation and qualitative assessment of visual content can be minimized by collating the assessments of multiple individuals and through use of repeated trials. Motivated by this problem, the authors developed an experimental approach for semi-automatic extraction of qualitative real estate metadata based on Comparative Judgments and Deep Learning. The authors evaluate the feasibility of our approach with the help of Hedonic Models. Findings – The results show that the collated assessments of qualitative features of interior images show a notable effect on the price models and thus over potential for further research within this paradigm.


Introduction
In addition to established and standardized real estate valuation methods, which in many jurisdictions are legally binding in certain contexts, Automated Valuation Models (AVMs) are increasingly being used in the profession.Their areas of application range from property and wealth taxation, to the verification of the value of real estate portfolios, to the calculation of insurance and credit risks (Brunauer et al., 2017).The premiums and discounts calculated by an AVM on the price of a piece of real estate in a given market are a function of a dynamic interaction of temporal and geographic factors, but also of intrinsic characteristics, both quantitative and qualitative, of the property itself.In consequence, the quality of an AVM depends on the amount of viable input data describing such characteristics.In general, standardized data is used in these models, which is coded into a readable format and usually represents general information about the property like, for example, purchase price, year of construction, floor area, or the street address, from which in turn further data points (distance to schools, the city center, etc.) can be computed.To supplement this observable objective data, subjective expert assessments of both the property and the wider market are usually gathered, too.These could concern, for example, the condition or location quality of the given piece of real estate.
Given the complex composition of this input, and the large amount of representative data required for Machine Learning (ML)-based AVMs to be effective, data enrichment and data pre-processing represent two of the most important tasks in the entire analytical process chain of AVMs.First, the researcher has to make informed choices regarding the data and variables to draw on in designing their AVM.Needless to say, these will be variables that the analyst assumes will have a marginal effect (in statistical terms) on the price of a property.Hedonic-based AVMs, in particular, use pooled information derived from multiple sources like real market transactions, offer data, or other proprietary and non-proprietary data sources (Glumac and Des Rosiers, 2020).Due to the continuous development in the field of Machine Learning and in order to maximize potential data sources, companies and data analysts increasingly extract information from additional modalities, such as visual or textual data (e.g. from real estate listings), so that researchers can integrate more qualitative data into their AVMs (Desai, 2019).In the case of condominiums and rental properties, interior photos are widely available as part of listings, which makes them easily accessible and promising resource in this regard.

Problem statement
As part of the conventional real estate valuation and marketing process, the assessment of qualitative real estate characteristics, as recorded in visual or textual data (images and descriptions in real estate listings, for instance) is an important but demanding task that is largely based on individual expertise.What is more, the individual assessment of criteria such as location quality, interior design, floor plan design, or the general condition of the apartment is often influenced by subjective perception, personal preferences, or self-interest.Unfortunately, not all of these problems are instantly resolved by moving to ML-based automated valuation methods, as manual annotation of qualitative features is a prerequisite for successful Machine Learning.The influence of the annotator's subjective interpretation is

Automated real estate valuation
usually not a problem when it comes to simple categorization, such as the detection and classification of objects within an image, e.g. the presence of a shower or a bathtub in a bathroom.This is why the annotation in such case is usually performed by a single person.Subjectivity becomes much more of an issue, however, when the semantic significance of multiple objects within an entire scene needs to be established.The problem can be explained more clearly with the help of Figure 1.The assessment of the quality of the shown bathrooms can easily suffer under the vagaries of subjective perception, personal preferences, or a lack of expertise on the part of the annotating person.The individual qualitative assessment of new interior items, such as a new kitchen, can be even more complex.The new kitchens shown in the image have a price range between EUR 700-5,000, yet their difference in quality is difficult to assess based on visual data alone.Consequently, qualitative interpretations by a single annotator can lead to biases in ML-based assessment model.To the extent that ML-based systems are designed to improve on these conventional methods, the question remains how such a bias can be minimized or even eradicated.
To this end, we hypothesize that the empirical error in quality estimation in general, and in the interpretation of semantic cues in visual media in particular, can be minimized by combining and collating the assessments of multiple independent individuals.Furthermore, we assume that by employing repeated trials when respondents rate the images, the total error variance within the ratings of an individual and thus overall range of collated ratings of the given visual content can be reduced.Thus, the challenging task in this context is to develop a research design that allows repeated independent assessments by multiple individuals, has acceptable manual and computational effort, and provides reliable results.Motivated by this problem, we developed an experimental approach for semi-automatic extraction of qualitative real estate assessments (i.e.quality ratings of real estate interiors) based on comparative judgments of visual content and computer vision.We investigate our assumptions in a concrete case study by having respondents rate the quality of bathrooms using interior views of residential apartments, by analyzing the effect of their ratings with the rental price of a unit, and by automatically generated ratings using human estimates.Thus, the present study has two main objectives: (1) To present an approach that efficiently combines and collates the subjective ratings of visual features by multiple individuals and to show how such a method could be easily adapted to Machine Learning approaches in a broader sense.
(2) To evaluate the effect of image-based assessments of the quality of interior spaces and the rental price of apartments, and thus to check the applicability of the approach to real estate use cases.
More broadly, we aim to outline new ways of information extraction for automated valuation models, which in turn would help to increase transparency in valuation procedures and thus contribute to more reliable statements about the value of real estate.

Literature review
To achieve the objectives outlined in the previous section, we combine approaches from different disciplines and mix quantitative and qualitative paradigm.In the following section, we would like to explain the theoretical background of the mixed methods that have been addressed in this paper.The literature review is structured into three sub-sections that cover important research areas that contributed to our own methodological design: media content analysis, hedonic valuation models, and deep learning.

Media content analysis and comparative judgments
Measuring visual stimuli based on emotions and subjective judgment is a widely used methodology in many research areas, e.g. by Baveye et al. (2018) and Greenwald et al. (1989).Despite advances in ML and current software systems' ability to recognize lower-level visual content, humans are still much more capable when it comes to perceiving semantic cues at a high cognitive and affective level.The basic idea for the proposed qualitative assessment is based on work by Hoffmann et al. (2012) and Thurstone (1994).It has been widely established that an observer often makes different comparative judgments about the same pair of stimuli on successive occasions.In other words, the observer is inconsistent in his or her comparative judgments from one occasion to the next (Melinger and Schulte im Walde, 2005).According to Thurstone (1994), any such phenomenon is referred to as a fluctuating discriminative process.Following Robinson (2005), we assume that emotions may potentially influence the evaluative judgments of the participants in our own research design.The analysis of emotions that are triggered by audio-visual stimuli is the focus of Affective Content Analysis (ACA).In the respective literature, various approaches to emotion mapping have been proposed, though discrete and dimensional mapping are the predominant ones.In the current study, we deliberately rely on discrete emotion models.Within this framework, there are 22 possible types of discrete emotions that could be triggered when participants view the visual content, and they are usually expressed as binary scales, e.g.pleased/displeased, approve/ disapprove, like/dislike, etc (Hoffmann et al., 2012).
Experiments involving ACA have a long tradition in broadcasting research (ITU, 2012).As early as the 1970s, the first concepts were developed to incorporate an automatism into the assessment process, where in the beginning the inferences were achieved by aggregating test results from objective computer-based and subjective human-based experiments (Webster et al., 1993).Affective Media Content Analysis has wide applications to this day, ranging from Sentiment Modeling (Chen et al., 2014) to Human-Computer Interaction and Affective Computing (Bee et al., 2006;Hoffmann et al., 2012) and Image Retrieval (Zhao et al., 2018).A highly complementary method to ACA for more effectively eliciting human responses from interaction with visual media content is Alternative Forced Choice (AFC) (ASTM International, 2009), which is essentially based on the concept of multi-alternative perceptual decision (Ditterich, 2010).The method allows for a scene or an object from the scene to be tested, with the scene or object sharing a common conceptual category and properties but nevertheless differing visually.The forced choice to be performed happens, as the name implies, from multiple but pre-determined alternatives.In the context of ACA, AFC has been used, e.g. for emotional face recognition (Thomas et al., 2007) or to determine individual color preference (Yu et al., 2020).

Hedonic theory in the context of automated real estate valuation
Computer-aided valuation of properties can be traced back to the 1970s (Carbone and Longini, 1977) and was initially introduced under the term Automated Assessment System (Case, 1978).Today, an Automated Valuation System (AVS) is defined as data analysis software consisting of single or multiple AVMs and a user interface; these elements combined

Automated
real estate valuation are used to establish a price estimate for an individual property or parcel of land through a structured decision-making process (Glumac and Des Rosiers, 2020).In its application, an AVM is essentially reliant on the data, the approach, and the method used.In terms of methods, the Hedonic Price Method (HPM) has dominated automated valuation for decades due to its versatility and flexibility.It is compatible with a wide variety of data and can be used along with different automated valuation approaches (e.g.probabilistic, nonprobabilistic, market or income approaches).Hedonic theory holds that a good is composed of many characteristics, all of which can affect its value (Rosen, 1974).An HPM analyses the marginal effects of these characteristics on the price of the good.The use of HPMs as an empirical valuation method can be traced back to 1939, originally focusing on the estimation of hedonic price indexes for automobiles (Court, 1939;Goodman, 1998).Research in this area was revived in 1961 in the work of Griliches (1961).Over the past decades, countless theoretical and empirical studies on hedonic pricing in the real estate and housing market have been performed.Good reviews of the literature in this area can be found in the work of Herath and Maier (2010) and Malpezzi (2008), amongst others.For on overview of the underlying functioning of an HPM, see 4.2.

Deep learning for visual pattern recognition
In certain scientific problems, such as in the one at hand, one is confronted with the application of different data modalities (e.g.tabular and visual data).Convolutional Neural Networks (ConvNets) are used in this study to extract intrinsic information from complex visual features.Like many artificial intelligence (AI) applications, the functionality of ConvNets is based on the theoretical foundations of deep learning, with a dominant focus on the theories of approximation and optimization, as well on the paradigm of representation learning (Bengio et al., 2013).ConvNets enable deep learning of highly representative image features from training data in a layered hierarchical fashion.An effective technique that successfully uses ConvNets for image classification and regression is transfer learning, i.e. fine-tuning the ConvNets models while maintaining criteria of domain adaptation (Shin et al., 2016;Robinson, 2005).
In order to arrive at a holistic price valuation of real estate using various data modalities, it is necessary to automatically include features in the price formula, for which the use of deep learning algorithms is currently the most promising approach.So far, ConvNets have been mainly used for classification of visual data (Koch et al., 2020).A real estate related example is the article of Renigier-Biłozor et al. (2022), where human emotions generated by looking at real estate images are classified in order to incorporate the detected emotions into a valuation model.In the article of Glaeser et al. (2018) ConvNet is used to evaluate the impact of the exterior and, in Poursaeed et al. (2018), the interior visual appearance of a building on prices.The use of ConvNets for regression is not as widespread as for classification problems, but is increasingly gaining application such as for position recognition in buildings (Ballesta et al., 2021).The regression ConvNet methodology is also used for predicting stock prices via annual reports and text analysis (Dereli and Saraclar, 2019) and using historical data (Mehtab and Sen, 2020).Other applications include prediction of angles (Fischer et al., 2015), prediction of distances for 3D position estimates (Mahendran et al., 2017), or age estimation (Rothe et al., 2016).In the real estate field, Solovev and Pr€ ollochs (2021) choose a pretrained ConvNet to predict apartment rent prices using pictures of the floor plans as the input.Shen et al. (2022) also use a regression ConvNet to predict rental prices in Wuhan neighborhoods based on the spatial density of points of interest.The authors find that a regression ConvNet outperforms other prediction methodologies.Regression-based ConvNets are of special importance for our approach, where we aim to predict subjective quality assessments from images.

Approach
The basis of our approach is the subjective estimation of the quality of bathrooms, where the obtained estimates are integrated into a hedonic model to evaluate their effect on the rental price.In a separate experimental stage, the same estimates are learned from a neural network to evaluate the feasibility of annotating new data.To better illustrate the proposed methods, we have summarized their theoretical foundation in this section.Figure 2 shows the methodological steps in the proposed order.

Elo rating
In any qualitative assessment that results in a classification, the main problem lies in the unavoidable fuzziness of the decision boundaries between individual classes.Consequently, the implementation of subjectively rated quality classes in the price prediction of a unit of real estate can have in certain scenarios a significant impact on the estimated sales or rental price.This is problematic, because it implies that detailed description lists and substantial experience on the part of the judging person are required to adequately assess the quality of a given real estate characteristic.To overcome this problem, we apply a system in which respondents are successively presented with pairs of randomly selected images from a pool with representative images and only have to decide which of the two images depicts the higher quality interior.By using such subjective serial pairwise comparisons of image features, we can assign images a metric score (and not a class!) that is based on simultaneous as well as successive contrasts as a result of repeated trials.In other words, respondents perform an Alternative Forced Choice between three alternatives (viz.win-lose-draw) whereby the responses are recorded by an automatic backend scoring system.The proposed approach is expandable and requires a sufficient number of respondents, representative images, as well as repeating of trials in order to decrease the error variance continuously and thus to generate useable data.
To measure the quality of real estate interior using loop wise direct comparison of image pairs in a continually updated manner (assuming that estimated quality scores will vary sufficiently in the data distribution) we apply the Elo formula.The Elo rating system is used in practice to quantify the relative abilities of chess players (Elo, 1978).In an Elo rating, each player starts with an initial score and depending on how he/she plays against players with higher or lower Elo, his/her score is being updated according to equations 1 and 3.
Following equation represents expected Elo value of a win for the problem at hand (Tsang et al., 2016): If image A has rating R A and image B has rating R B , then the expected value of image A beating image B is given by (1) Figure 2. Overview of the methodological steps in the proposed order, starting with subjective assessment based on voting (left), computation of quality ratings, data partitioning and automatic generation of ratings (middle) and concluding with evaluation using hedonic models (right)

Automated real estate valuation
where R A and R B are initialized with 1500.Hence, note the logistic property: The parameter 400 controls the different probabilities of the possible outcomes in favor of either the higher or the lower rated image (Elo, 1978).Having an expected score for image A when voting against image B and three possible outcomes for A, namely win, lose, or draw, which correspond to values of 1, 0, and 0.5, respectively, we calculate the Elo score as follows: The constant K (K 5 10) is used to adjust the weighting sensitivity of the score update.It is assumed that the Elo algorithm is sufficiently robust to map the scores with appropriate proportionality, provided that player performance resp.Player skills (in the present case the participants) follow a normal distribution, i.e. remain constant over time, which is not always the case in reality (Glickman and Jones, 1999).

ConvNet regression
Elo scores can be obtained by pair-wise comparisons of human raters as described in Section 4.1.This is, however, time-intensive and thus expensive.We propose to estimate Elo scores automatically using a regression-based ConvNet.Thus, in this experimental phase, we aim to verify whether human-estimated scores can be learned from a ConvNet and generalized to new images unknown to the trained network.For training, we use the scalable EfficientNet network (Tan and Le, 2019).The network uses optimized constants for width, depth, and resolution, as well as the coefficient for available computational resources, to allow adaptive model scaling to set up models for different input sizes and different numbers of floating point operations (FLOPS) -(EfficientNet models B0-B7).The underlying structure of the base model EfficientNet-B0 consists of seven building blocks, each with inverted residual blocks (Sandler et al., 2018) and stem layers (see Figure 3).Stem layers act as a compression mechanism that leads to a rapid reduction in the spatial size of activations, reducing storage and computational costs.Residual blocks are inverted blocks with depthwise separable convolution (Chollet, 2017), which in turn significantly reduces the number of parameters.Furthermore, each residual block contains a squeeze and excitation sub-block that dynamically assigns high weights to more important channels, thus mapping channel dependence while providing access to global spatial information of the input signal.
If the model is going to be trained for a regression task, some modifications of the network are required.This includes adding a batch normalization layer, a dropout layer, and a regression output layer at the top, and changing the last dense layer to 1 neuron (see Figure 3).

Hedonic Regression
To evaluate the effect of estimated and predicted Elo scores on the real estate price, we use the Hedonic Price Method (HPM).HPM is also known as Hedonic Regression and is a commonly used method to predict real estate prices by estimating the marginal contribution of real estate characteristics to the price.In general, the hedonic price function has the form where P is the price of the unit of real estate and f is a function of the vectorized values Z, which describe the characteristics of the property.The basic assumption in Hedonic Pricing is that the relevant determinants of the dependent variable (price or index) are known in advance.In practice, different variables are used depending on the research question, the preferences of the researchers, or the availability of data (Herath and Maier, 2010).Sirmans et al. (2009) summarize 470 possible variables for Hedonic Pricing that have been used in the scientific literature to date.If we divide real estate characteristics into three main subcategories, then the price function has the form: where S i is a vector of structural real estate characteristics, L i is a vector of location variables, and N i represents neighborhood characteristics.In the present study, we include the human estimates in the structural variables S and omit the neighborhood characteristics N.For the variable setup in our hedonic model, see Section 5.3.2.

Experimental setup
In line with the proposed methodological steps (see Figure 2 in Section 4), the following section outlines criteria for evaluating the performance of our model in relation to the research questions, provides a detailed description of the data, and lists individual experimental steps.

Research questions
Our objective is to answer the following research questions: (1) RQ1: To what extent can qualitative characteristics of real estate interiors be derived from photographs by means of repeated comparative judgments of multiple individuals and how significant is their effect on the rental price of apartments?
(2) Beyond this, we want to answer RQ2: To what extent is a ConvNet regressor able to generate plausible quality judgments by learning these human estimates from associated images and are these predicted judgments beneficial for real estate price estimation?
The research questions are discussed in Sections 6 and 7.

Data
For the human-based estimation of the bathroom quality, we use the initial pool D i with manually selected 1,000 representative images of bathrooms.The images originate from the real estate listings published in the year 2020 (Justimmo, 2021) which include also structural and location characteristics.Of these, 250 images correspond to instances in test dataset P 1 which includes structural and location characteristics of 250 rental apartments (one image per apartment).The test data set P 1 is then used to derive the effect of human-estimated as well as of ConvNet-predicted bathroom quality scores on the apartment rental price (see Sections 5.3.3 and 5.3.4).The remaining 750 images from D i were used for the training and Automated real estate valuation validation of the ConvNet (see Section 5.3.3).For the human estimation, the images were not processed, i.e. they were used in their original shape and size.

Steps of the experiment
In the sub-sections below, the individual experimental steps are described in detail.5.3.1 Elo rating: setup for human judgments.We customized a web browser application (Gerneth, 2014) based on the Python modules Flask and Sqlite.Using the application's user interface (see Figure 4), the participant can vote between two displayed images according to Three-Alternative Forced Choice (3AFC) paradigm between a win for either image or a draw for the displayed image pair.For each vote a quality score is calculated backend based on Elo rating (see Equations 1 and 3).For each voting round with a respondent, 500 pairs of images are randomly regenerated out of the image pool D i with 1000 representative images.Note that not all possible combinations of image pairs can be tested in the procedure, since this would result in 499,500 possible combinations for 1,000 images (n(n-1)/2).The voting incorporates 16 voting rounds with 8 participants and 2 repeated trials per participant whereas each person could repeat the trial only after the remaining 7 participants have voted once.After each voting round all estimated Elo scores are saved into a separate database table resulting finally in a final data set S c with 1000 scores for each image from D i .
5.3.2Evaluation of the effect of human estimated scores on the rental price.This experimental setup is designed to address research question RQ1 (see Section 5.1).We union test data set P i with associated scores from S c as test data T s1 .We use test set T s1 to estimate target variable apartment gross rent price in a hedonic model M t1 by including estimated scores along with the structural apartment characteristics living area, year of construction, floor, garden, and overall condition of apartment as predictor variables.Thereby, we infer the explanatory power of different settings of the model and the marginal effects of selected predictors.Hereby, we control model for the apartments' overall condition, which permits a more conclusive inference on the effect of estimated scores.

Training of ConvNet with human estimated scores.
The following three experimental steps aim to answer research question RQ2.(see Section 5.1).Using the initial set of 1,000 images from D i and human estimated scores from S c we partition the data in a ratio of 65:25:10 as training set Tr 650 , test set T 250 and validation set V 100 whereas the 250 test images from T 250 correspond to instances in the test dataset P 1 .Thus, we assign from Tr 650 human estimated scores as the response variable and associated images as the predictor variable and train the ConvNet for the regression task.For training, we first crop the images to a square shape, starting from the center of the image maximizing the image dimension by either the original height or width.Then the images are scaled to 224 3 224 according to the dimension of the EfficientNet0 input.For the training we set training parameters as shown in Table 1.We limit the image augmentation to horizontal flipping only, as additional augmentation showed a negative impact on the network performance.We decide to let the ConvNet converge more slowly, thus reducing the learning rate on plateau when the loss stops decreasing, starting from 0.001 through 0.0002 to a minimum learning rate of 0.0001.We do not apply early stopping for regularization, but post select the training stage with the best performance.As model metric for validation and training we apply mean-squared-error (MSE) loss and root-mean-squared-error (RMSE).
5.3.4Evaluation of the effect of ConvNet predicted quality scores on rental price.We evaluate the scores predicted by ConvNet in a hedonic model.Thus, we use the trained ConvNet, to predict scores for images in the data set T 250 .Subsequently, we union test data set P i with predicted scores as test data T s2 .We use test set T s2 to estimate target variable gross rent price in a hedonic model M t2 having exactly the same evaluation settings and having the same data for the remaining features as for model Mt 1 .

5.3.5
Training and evaluation of convnet using training sets of different sizes.Final experimental step aims to evaluate to what extent the training of ConvNet can benefit from the size of the training set resp.From additional training data and how this affects the price estimation.We quantify the degree of improvement using root mean squared error (RMSE) to weight the cost of creating additional annotated training data against the expected benefit, e.g., if the expected benefit of additional training data is small, it may not justify the cost of creating additional annotations.We randomly subset training set Tr 650 into two smaller training sets Tr 450 , and Tr 250.Subsequently we train the network with training sets Tr 650, Tr 450 , and Tr 250 and evaluate all three models on the test images in T 250 .We then merge test data set P i with the predicted values of the three trained networks and test the effect applying the 2nd setting of the hedonic model M t2 .

Results
In the following section, we first analyze the results for human estimates and their ConvNet predictions.We then evaluate their impact on the rental price in two hedonic models.

Automated real estate valuation
posed.For a better understanding, the estimated values for the bathroom quality will henceforth mainly be referred to as human estimated scores.

Analyses of human estimated scores and their ConvNet predictions
We examine in the first step selected statistics of interest for the human estimated and predicted scores.Starting from an initial Elo score (R 5 1500), the min-max range and the variance of the human estimated scores have increased with each subsequent voting round and revealed expectedly a slightly double-humped data distribution.The final min-max range is eventually 129 Elo scores.The plot in Figure 5 shows positive correlation with moderate relationship between the human estimated and predicted scores which is also confirmed by Pearson correlation coefficient (0.426).
It is assumed that the variables estimated scores and overall condition could have both, a shared and an independent effect on the apartment rental price, which could also be observed.The Kruskal-Wallis rank sum test indicates small p value (0.001826).Therefore, it is expected that there are noticeable differences between the two variables in their central tendency.If we incorporate a smoothing conditional function into the scatter plot (Figure 5), we observe a tendency that bathrooms with higher scores are close to new and fully renovated apartments.On the other hand, bathrooms with lower scores tend to cover apartments with less attractive quality states such as well-kept, like-new and partly renovated, whereas the confidence band for partially renovated apartments shows obvious uncertainties in the estimates.This is mainly due to an imbalance of the condition classes, respectively the ratio of the cardinalities of the condition classes partly renovated and well-kept is 1:12.5.We can also find the weakest correlation between the scores and apartments' overall condition like-new.The frequency analysis for the variable year of construction and apartments' condition like-new shows distribution of instances across all construction periods.This calls into question the relevance of the feature expression like-new.
D'Agostino's skewness test applied to human estimated scores shows certain skewness in the data distribution (p 5 0.01216).However, the modeling showed no improvement by incorporating the logarithmic transformation of this variable.The same applies to the predicted scores.The visual comparison of the images and estimated scores reflects the subjects' subjective perception of what is seen in Figure 6, which shows the same bathroom photos as in Figure 1, this time with human-estimated scores in the upper right and ConvNet predicted scores in the lower right.The given images were selected for the illustration because the qualitative differences of the interiors and the deviations between the ground truth and the predictions are obvious in them.
We also examine the change in the empirical error of the estimated bathroom scores per voting round by setting up the hedonic model M t1 for the estimation of log-transformed response variable gross rent.Table 2 shows the model performance after voting rounds 8, 10, 12, and 16, indicating the standard errors, the coefficients, the T-and p-values of the scores, as well as the adjusted R 2 of the model and the p-value of the F-test.The decreasing standard error, p-value, price premium as well as p-value of the F-test per voting round are noticeable.On the other hand, increasing T-values and Adj.R 2 can be observed, whereby the increase of the Adj.R 2 is not excessive in view of the low price premium imposed by the bathroom scores.

Evaluation of the effect of estimated quality scores on rental price
In this stage, we examine the extent to which the estimated scores have an impact on the rental price of the apartments if we control for additional variables of interest.Table 3 shows summary of four different settings of the hedonic model M t1 .In the table, the standard errors are given below the estimators, rounded to 3 decimal places for clarity and the significance of the predictors can be easily derived from the indicated asterisks.In the 1st setting, the response variable is regressed without any of two predictors for condition quality.In the 2nd setting, estimated scores are added, in the 3rd setting the model is controlled for the variable overall condition of the apartments (estimated by experts or real estate agents).The variable living area shows pronounced skewness (p 5 4.331 3 10 -16 ) and is therefore log-transformed in the 4th setting expecting an additional model improvement.
In all 4 settings, the response variable gross rent is log-transformed.Reference levels for predictors city, overall condition and floor (not shown in regression outputs in Tables 3 and 4) are city of Klagenfurt, new apartment and 1st floor.
In the model M t1 , the values of the variable bathroom scores in the 2nd setting indicate a significant effect on the response variable.As we already stated in Section 6.1, it cannot necessarily be inferred from the condition of an individual room that the entire apartment is in

Automated real estate valuation
an equivalent general condition what has been previously proved using Kruskal-Wallis rank sum test.Analogously, a strong relationship between overall condition and the response variable was found using the same statistical test as well as moderate correlation of 0.359 between estimated scores and the response variable was also observed.Further support for the effect of estimated scores on rental price is found in the 3rd model setting, in which we control for the condition of the apartment.If we include solely the condition of the apartment without scores, the model improves to 0.8237 compared to the baseline model setting and, as expected, shows a stronger effect than estimated scores in the 2nd setting.However, this effect is significantly smaller in the 3rd setting if we include estimated scores.In addition, the scores in the 3rd setting remain significantly high.This confirms that there is additional informational content in the images that our human respondents were able to extract.In addition to these effects of central concern, additional relevant observation can be made based on the 4th experimental setting of the model.The adjusted coefficient of determination in the 4th setting shows notable improvement in the model's goodness of fit due to effect of the log-transformed living area.Since the variable year of construction shows a non-linearity, as expected, the application of orthogonal polynomials is superfluous, whereby the significance of the variable has increased noticeably.Also, in this 4th setting, the scores' significance remains high.Based on the present results, it can be confirmed that there is a notable influence of the human estimated scores on the response variable.An in-depth analysis of these results is provided in Section 7.

Evaluation of the effect of predicted scores on the rental price
In this subsection, we examine the effect of scores for the quality of bathrooms on the rental price predicted by ConvNet.Table 4 shows the four settings of the model M t2 , which includes network's predictions in place of the human estimated scores.Except for this key change, the data and the variable setup is exactly the same as in model M t1 .The integration of the ConvNet-predicted scores into the model (2nd setting), improves the goodness of fit of the model by solely 0.9%.We note the high significance of the predicted scores in this setting of the model.Incorporating the variable general condition of the apartment in the 3rd setting increases the goodness of fit of the model by 1.4%, while the significance of the predicted scores remains high, but with a clear decrease of p value from 0.000664 to 0.00889.We can observe the same effect in the 4th setting if we include log-transformed living area, predicted values remain in the significant

Automated real estate valuation
range though with a decrease in p value.Besides, it can be observed that the goodness of fit in the settings 3 and 4 in the Model M t2 is slightly lower than in the Model M t1 due to the weaker effect of the predicted scores.This confirms the more significant explanatory power of human-estimated scores over network predictions.Finally, we note that the correlation between the predicted scores and the response variable (0.243) is weaker compared to the human estimates and the response variable (0.358).The results show that the generalizability of the network decreases with each smaller dataset used for training.The RMSE, for instance, shows decreasing of 9.5 and 4.8 prediction errors for models with training sets Tr 450 and Tr 650 .Also, a slight improvement in the model goodness of fit in the M t2 in 2nd Setting is noted when we use more data for training.

Discussions 7.1 Limitations of the study
There are several limitations to this study that should be acknowledged.
(1) The sample size for experiments is not large.The amount of the test and training data for ConvNet as well as data for hedonic models depends predominantly on the overall number of voted images.We had to design the subjective comparative experiments in such a way that the duration of a voting round would not be excessively long and would not strongly affect the validity of the responses.
(2) There are no reference data sets in the scientific community that could be used for the present problem at hand.
(3) Lastly, there are no research studies on this topic within the real estate domain to be compared.Despite mentioned limitations, this study provides valuable insights into the potential benefits of the approach presented.The results support the need for further research to confirm and extend these findings, and to address the limitations of this study.

Interpretation of results and key findings
There is a large amount of available multimodal, semi-structured and unstructured useful information that can be incorporated into AVMs and thus can provide deep and comprehensive insights to appraisers and help to draw further conclusions in price estimation.By integrating alternative data extraction approaches, as well as the ability to process larger amount of data using Machine Learning, modeling difficulties can be significantly reduced and data from new sources and modalities can be obtained (Lahat et al., 2015).As for the problem at hand, the usefulness of human-based qualitative estimation of individual apartment rooms and the transfer of the estimates to a Machine Learning model for prediction on unknown data is justified if an effect of the human estimates and predictions on property price can be confirmed within a potentially applicable model.
The data obtained in this study through human subjective judgments demonstrate the desired effect in two ways: (1) empirical error in quality estimation is being minimized by repeated interpretation of visual semantic clues by multiple subjects (Table 2), (2) the humanestimated quality scores in this study showed a well-correlated relationship and a discernible stable effect on the response variable in the hedonic model (Table 3).This is also validated in model settings where we control the effect for the overall condition of the apartment.This confirms that, despite the limitation mentioned, our methodology for eliciting and computing qualitative ratings (i.e. the Elo scores) provided a sound approximation of subjects' affective responses.
We tried to answer the question to what extent it is possible to generate synthetic subjective quality judgments, i.e. to annotate new data in an automatic fashion using information gained from the previous experiment.For this we could observe the following.(1) The price premium of the scores in both hedonic models is small, which is to be expected (Tables 3 and 4), and from Table 4 that (2) the predicted scores in the second model in all settings are within significant range, (3) and the predicted scores in the 2nd model lose significance once stronger marginal effects of additional predictors occur.However, the observed results indicate that the network can detect additional information content in the images resp.Demonstrates a recognizable approximation.The loss of variable significance and goodness of fit in M t2 reinforces the assumption that with a larger amount of training data, the approximation power of the network should be improved and thus the effect of predicted scores on the response variable in the model.
The values for the M t2 model goodness of fit in Table 5 are attributable to the use of more training data.The results in regard to generalizability of the neural network confirm that additional training data contribute to higher prediction accuracy.These results are consistent with previous findings on computer vision methods in the literature (Luo et al., 2018;Li et al., 2016) that network architectures successfully benefit from additional training data and thus generalize better to independent test data.

Conclusions
We presented a study on incorporating repeated subjective human judgments about bathroom quality using visual information for the integration into AVMs.We have also tested the robustness and generalization capability of a popular convolutional network architecture for a regression task to predict estimated human judgments.Thereby, we would like to point out that our study is intentionally open-ended and primarily designed to establish potential Automated real estate valuation applicability of the approach, i.e., the results were obtained without specific optimization of any of the methods and thus used to establish a baseline.Our goal was to show how coherently the presented methods work on the existing dataset in the featured scenario without further tuning them and thereby potentially overfitting them to the data.With this study and our findings, we want to extend the existing information extraction methods for automated valuation models, which in turn would contribute to a higher transparency of valuation procedures and thus to more reliable statements about the value of real estate.We see several promising directions for extending our study in future: (1) The possibility to process more information by controlling the number of inputs for rating, (2) applicability of the proposed approach in different scenarios rooms, or location quality), (3) combining visual data from additional context (e.g.floor plans) to extend the information content and thus improve the generalization of the estimates.
Figure 1.Exemplary representation of new kitchens (top) and bathrooms (bottom) Figure 3. Simplified representation of EfficientNet-B0 architecture with modified top layers (outlined in the figure with dotted lines) for regression task Figure 4. Graphical user interface of the voting application Figure 5.The scatter plot shows the relationship between the human estimated and predicted values aggregated across the classes of the overall condition of the apartment Figure 6.Images of bathrooms showed previously in Figure 1 with human estimated quality scores (top-right) and ConvNet predicted scores (bottom-right) In the last step, we elaborate on the results on Convnet training performance with respect to training data of different sizes.Section 7 discusses the results in relation to the research questions

Table 2 .
Regression and evaluation of convnet using training data of different size With the following experiments we want to evaluate how important the size of the training set is for the task of scores prediction and if the evaluated network architecture can effectively take benefit of additional training data.Knowing the impact of additional training data is important in practice, as it helps to weigh the cost of creating additional labeled training data against the expected benefit.Thus, with the help of this experiment, we investigate the explanatory power of the network using training data of different size and infer in Section 7 if this in practice eventually can impact the effect of predicted scores in an applicable hedonic model.To answer this question, we perform ConvNet training on differently large training sets Tr 650 , Tr 450 , and Tr 250 to see if additional training data improves test performance.The evaluation is performed in on the same test set, i.e. for all training partitions Tr 650 , Tr 450 , and Tr 250 there is one test set T 250 which contains unseen independent data.Quantitative results are summarized in Table 5.To investigate the impact of the dataset size on scores extraction performance we always train EfficientNet-B0 using same pre-trained weights, first with the largest training set Tr 650 and then with the remaining two smaller training sets.