Modeling geo-homopholy in online social networks for population distribution projection

Purpose – Projecting the population distribution in geographical regions is important for many applications such as launching marketing campaigns or enhancing the public safety in certain densely populated areas. Conventional studies require the collection of people’s trajectory data through offline means, which is limited in terms of cost and data availability. The wide use of online social network (OSN) apps over smartphones has provided the opportunities of devising a lightweight approach of conducting the study using the online data of smartphone apps. This paper aims to reveal the relationship between the online social networks and the offline communities, as well as to project the population distribution by modeling geohomophily in the online social networks. Design/methodology/approach – In this paper, the authors propose the concept of geo-homophily in OSNs to determine how much the data of an OSN can help project the population distribution in a given division of geographical regions. Specifically, the authors establish a three-layered theoretic framework that first maps the online message diffusion among friends in the OSN to the offline population distribution over a given division of regions via a Dirichlet process and then projects the floating population across the regions. ©Yuanxing Zhang, Zhuqi Li, Kaigui Bian, Yichong Bai, Zhi Yang and Xiaoming Li. Published in the International Journal of Crowd Science. Published by Emerald Publishing Limited. This article is published under the Creative Commons Attribution (CC BY 4.0) licence. Anyone may reproduce, distribute, translate and create derivative works of this article (for both commercial and noncommercial purposes), subject to full attribution to the original publication and authors. The full terms of this licence may be seen at http://creativecommons.org/licences/by/4.0/legalcode This work was supported by National Natural Science Foundation of China under grant number 61572051 and 61632017 and by the National 973 Grant under grant number 2014CB340405. Population distribution projection


Introduction
The study of population distribution in fixed geographical regions (e.g.states, provinces) is of paramount importance for the government to enhance the public safety in certain places with a large floating population (FP) or for the business to launch marketing campaigns in densely populated areas (Harris and Todaro, 1970).The data relevant to people's trajectory are conventionally collected through offline sources.For instance, it is feasible to predict the FP in the transportation systems by analyzing the origin-destination data of passengers (Chi, 2010); Customers' bank notes can be used for modeling the human trajectories as a continuous-time random walk (Gonzalez et al., 2008).The government (e.g.statistical bureau) collects the demographic data to investigate the correlation between human migration patterns and geographic labor demand and supply (National Bureau of Statistics of China, 2010).The challenges of conducting these studies are attributed to the high cost of the data collection methods (regarding time, manpower and money), and the restriction of accessing such data sets due to security and privacy concerns.
The wide use of online social network (OSN) apps over smartphones has accumulated a rich set of geographical data that describe anonymous user trajectories (Kido et al., 2005) and habits (Gao et al., 2017) in the physical world, which holds the promise of providing a lightweight means to study the population distribution (Li et al., 2016).For example, many OSN applications, such as Facebook, Weibo, allow users to "check-in" and explicitly show their locations (Guo et al., 2013;Nazir et al., 2008;Liao et al., 2015); some other applications have implicitly recorded users' georelated information such as GPS coordinates, IP address (Backstrom et al., 2010;Zheng et al., 2010).Existing research has shown the feasibility of using the OSN data to predict users' offline locations as well their mobility patterns (Cho et al., 2011;Li et al., 2010).Moreover, the online relationship between friends can affect their social ties in the physical world (Zheng, 2012): the "close" friends in the OSN are also physically close to each other (Cho et al., 2011).
However, it is still unclear that which type of OSNs can assist determining the population distribution in given geographical regions.Intuitively, there are two observations: (1) It is easy to draw a population distribution over geographical regions that are stablemost people in a region do not travel distantly; and (2) Acquaintances in the same geographical region have a strong desire to communicate with each other through the OSN (Girvan and Newman, 2002).
Therefore, we seek to answer the following questions in this study: Is there a way of measuring the stability of geographical regions by observing the online message diffusion among people in those regions?How to derive the offline population distribution over a stable division of regions?Given a population distribution, how to project the FP across regions?
Our research findings indicate that a division of geographical regions is stable only if the OSN users in these divided regions show a strong geo-homophily; people in each region IJCS 1,3 prefer communicating with others in the same region more than those in other regions, and the Dirichlet process (DP) (Neal, 2000) provides a viable way of modeling the distribution of OSN users across offline regions.These inspire us to investigate the relationship between the online information diffusion, i.e. users' communication in OSN, and the population distribution over a fixed division of offline regions.
In this paper, we present a systematic approach that projects the offline population distribution in fixed geographical regions by modeling the geo-homophily of OSNs.Specifically, we establish a three-layered theoretic framework that first maps the online message diffusion among friends in the OSN to the offline population distribution via a DP and then projects the FP across geographical regions given the derived population distribution.The contributions of this work are summarized as follows: Connecting online data to stability of geographical regions: We establish the correlation between online message diffusion and the stability of geographical regions by modeling the geo-homophily of an OSN with geographical attributes.We derive the condition for a division of geographical regions to have a non-decreasing stability.DP-based prediction models: We formulate the population distribution problem from the perspective of DP and present a theoretical framework to project the population distribution over fixed geographical regions by casting online message diffusion into the established framework.Based on the derived population distribution, we propose a prediction model that utilizes the message diffusion graph in OSNs to infer the FP across geographical regions.
Experiments using large real-world data sets: By experiments over the real-world data sets, we validate the efficacy of the model in projecting the population distribution over fixed regions and meanwhile show that the proposed prediction models have a high prediction accuracy in characterizing the process of how the FP changes across regions upon the occurrence of societal events (the mass human migration caused by the Chinese Spring Festival 2016).
The rest of this paper is organized as follows.We introduce the related work and technical background on DP in Section 2. We provide the system model and formulate the problem in Section 3. In Sections 4 and 5, we show the approach of projecting the population distribution and present the model of predicting the FP, respectively.We validate the proposed model by experiments in Section 6, and conclude the paper in Section 7.

Related work 2.1 Geographical views of online social networks
Prior work on geographical aspects of OSNs has mostly focused on prediction and analytics of various properties in OSN by leveraging the location-related information.
2.1.1Predicting mobility patterns using online social network data.Users' locations can be predicted by mining their periodic behaviors in social network, given that the observed movement is associated with certain reference locations (Li et al., 2010).Cho et al. (2011) show that human movement and mobility patterns have a high degree of freedom and variation, but they can still exhibit structural patterns due to geographical and social constraints, on basis of two observations: (1) short-ranged travel is periodic both spatially and temporally and not effected by the social network structure; and (2) long-distance travel is more influenced by social network ties.

Population distribution projection
Thus, the historic data can be used to predict where a user might travel.
2.1.2Data dissemination in a geographical perspective.Wang et al. (2014) pose a threelayered architecture to model the data dissemination in OSNs, present a density function of general social relationship distribution and derive the tight lower bounds on traffic load of data dissemination in the OSNs under the assumption that every source sustains a data generating rate of a constant order.
2.1.3Online and offline social behaviors.Zheng (2012) propose a location-based social network (LBSN), which consists of the new social structure made up of individuals connected by the "interdependency" derived from their locations in the physical world as well as their location-tagged media content, such as photos, video and texts.Hristova et al. (2014) experimentalized on a data set with 74 college students as volunteers by observing evidence of homophily with regard to many factors within the online and offline social networks.They found that the social tie among students at the same educational institution was strongly affected by residential sector and year in college, but it exhibited diversity in other online aspects, leading to the affirmation saying diversity online is relative to diversity offline.
2.1.4Social tie inference.Sociological phenomena can be also observed within OSNs.Although the OSN platform has facilitated people's communication, the volume of OSN communications between OSN friends (the strength of the social tie between them) is inversely proportional to the geographical distance, following a Power Law (Goldenberg and Levy, 2009).Considering the co-occurrence in time and space (Crandall et al., 2010), Crandall et al. (2010) present a probabilistic model to prove that even a very small number of co-occurrences can result in a high empirical likelihood that the two people know each othera social tie between them, which tells us a way to infer the social network structure only by capturing individual physical location over time.

Dirichlet distribution and Dirichlet process
Dirichlet distribution is the conjugate prior of multinomial distribution, which can be seen as a distribution over distribution.The probability density function is written as: There are two parameters: (1) The scale a ¼ P i a i : a small scale a favors extreme distributions, but this prior belief is very weak and is easily overwritten by data, whereas an extremely large a makes the samples be more consistent with the base measure.
One popular application of Dirichlet distribution is latent Dirichlet allocation on topic discovery in natural language processing.It is a generative statistical model aiming at describing sets of observations by connotative groups why some parts of the data are similar.
DP is a class of Bayesian nonparametric models, and DP generalizes Dirichlet distribution (Neal, 2000).DP is a distribution function in a space of infinite but countable number of elements, which also requires a scale parameter a and a base measure G 0 , IJCS 1,3 denoted as DP(a, G 0 ).DP is an important method in Bayesian inference to identify the prior distribution of random variables, and it is widely used for density estimation, semiparametric modeling and sidestepping model selection/averaging.One important implication is that DP helps find the number of active components which is much less than the number of samples.In this paper, we investigate how to use DP to model the process that OSN users are distributed into geographical regions.

System model
In this section, we propose a three-layer framework that analyzes the message diffusions in the OSN to determine the stability of geographical regions.This problem is equivalent to the determination of whether the OSN has a strong geo-homophilymore specifically, whether the structure of the message diffusion graph is similar to that of the divided regions.We extend the concept of modularity (Newman, 2004) to quantify the degree of the geo-homophily of an OSN, and meanwhile we specify the condition on the geohomophily of an OSN for the stability of underlying geographical regions to remain nondecreasing.
In Figure 1, we show a three-layered framework consisting of: Layer 1 that captures the message diffusion graph in an OSN; Layer 2 that seeks to derive the user population distribution from the geo-location of OSN users in Layer 1; and Layer 3 that predicts how the FP will change given the distribution derived in Layer 2.
From the top to the bottom layer, we first investigate how the messages diffuse among groups of people that have similar geo-locations.If people in the same geo region communicate frequently, it is highly likely that the structure of the message diffusion graph is similar to that of the underlying division of regionsthe strong geo-homophily exists between the OSN and the offline regions.As a result, we can use the geo-location of messages among OSN users to derive the user distribution over the given regions.Then, the FP across regions can be further inferred based on the derived distribution.Population distribution projection

Geo-homophily of an online social network over divided regions
We define the geo-homophily of an OSN as the degree of similarity between the structure of the message diffusion graph in the OSN and that of a given division of regions.
We calculate modularity to quantify the geo-homophily of an OSN.Given the message diffusion graph of an OSN G = (V, R, E, T), where V denotes the set of users, R denotes the set (or division) of regions and E denotes the set of edges e uv with weights r uvT .The weight r uvT represents the number of views from user u to the content sharing by user v during time period T. The e uv exists if the number of views from user u to the content sharing by user v during time period T is non-zero.Let X T be X We can easily transform E from a user-to-user perspective to a region-to-region one, recorded as « , where Ve ij [E has value: which represents the number of views from nodes in region i to the content of nodes in region j during time period T. Let p ijT be the proportion of messages from i to j during T, namely, To quantify the geo-homophily of an OSN G = (V, R, E, T), we define the modularity on R during T, Q RT , as: It reflects the centrality of messages that are transmitted within same regions.Apparently, Q RT ranges in [À1, 1], where Q RT reaches 1 if all diffusions take place inside the same regions and reaches À1 when none of the messages are transmitted between users from same regions.The greater Q RT is, the higher geo-homophily the OSN has.
If an OSN shows a strong geo-homophily over the divided regions, most OSN users have more preference of communicating with others in the same region rather than with those in other regions, which implies each user is more attached/attracted by his current region instead of other regions, thereby leading to a high stability of each region.Next, we will show how to determine the change of the stability of a division R by imposing a condition over Q RT .

Stability of a division of regions
The modularity quantifies the geo-homophily between an OSN and the underlying geographical regions.However, it is infeasible to foresee whether the regions will remain stable because the structure of the message diffusion graph is dynamically changing.For instance, a breaking news may reform the structure of the message diffusion graph and push people to move across regions, which may make the stability of the divisions vulnerable.Next, we will deduce: under what condition, the stability of a division of regions will remain non-decreasing.
Formally, given two time periods T = [t 0 , t 1 ] and T 0 = [t 0 , t 2 ], where t 2 > t 1 , we need to find the distribution of messages in period [t 1 , t 2 ] that leads to an equal or higher modularity in at the end of T 0 , i.e.Q RT # Q RT 0 .We define the social-entropy of message diffusion inside and outside regions in the message diffusion graph G as: As the redistribution of message diffusion inside each region do not affect the modularity according to equation (1), we will only focus on those message diffusions (edges) across regions.
Hence, we combine all edges within a region into a new set.Let I T ¼ [ i2R e iiT , and H(G) can be rewritten as: New message diffusions in time period [t 1 , t 2 ] will create new edges and construct a new message diffusion graph G 0 (that can be extended from G).Let l ij be the number of new edges from region i to j in G 0 , which are not included in G.Note that Vi, j [ [1,|R|], l ij ≥ 0. Let L jRjÂjRj be the matrix of l ij , and L ¼ X i;j l ij where L ( X T ).Let l I ¼ X i2R l ii be the number of new edges inside regions.
To measure the impact and the change to G caused by new message diffusion L, we define Information Increment, G G; L À Á , as follows: According to equation (3), the social-entropy becomes: The following proposition prescribes the condition for the stability of divided regions to remain non-decreasing, based on the analysis of the OSN message diffusion graph.
P1.Given a message diffusion graph G over a division of regions, the geo-homophily will not decrease, if G G; L À Á is no smaller than X L T , where L ( X T .Proof.The degree of the geo-homophily of an OSN will not decrease if the social entropy never had a tendency to increasei.e.DH is non-positive, where:

Population distribution projection
That is: Then we can substitute G G; L À Á !X L T into equation ( 6) and we could conclude with this proposition.h

Population distribution projection
Given a division of regions, the geo-homophily is an indicator of the similarity between the structure of the OSN message diffusion graph and that of the division.The stronger geo-homophily an OSN has, more in-region communications occur between friends in the same region rather than across-region communications.Whenever a new user joins the OSN, he/she is highly likely to be distributed to the region where most of his/her friends reside.This is similar to the Chinese restaurant process (one representation of DP), which describes how guests are assigned to different tables in the restaurant according to the existing guest distribution.
In this section, we present a Bayesian nonparametric model based on the DP, which predicts how users in a OSN with strong geo-homophily are distributed over a given division of regions.In contrast, the weak geo-homophily in the OSN over given regions fails to establish the link between OSN message diffusion and the user distribution, which leads to a low prediction accuracy.

User distribution model
We propose a user distribution model (UDM) on basis of the Dirichlet process mixture (DPM) model for learning the hyper-parameters of the gathering mode, which is defined as a distribution of a random probability measure u.A UDM has two parameters: base distribution u 0 which is considered as the mean of DP and the scale parameter a which is like an inverse-variance of the DP.Then we have: representing a draw of a random probability measure u over a given parameter space U from the corresponding DP.For every user u [ V, we can draw a relevant u u from u. Here, a affects the probability that u u = u v , u = v.Thus, sampling from UDM is executed by the following generative process: where F is the likelihood function determining which region user u belongs to.Due to the cluster property, the number of distinct u 's would be exactly |R|, far less than |V|.Let ũ r ; r 2 R be the non-redundant hyper-parameters.
We have u in |R| dimensions where X r2R a r ¼ a, i.e.: U $ Dir fa r g r2R À Á IJCS 1,3 Define n r be the amount of r u that equals to r for every user u, and we can deduce the posterior distribution as: Thus, the marginal probability would be: According to the Bayesian theory, for user u 6 2 V, the predictive distribution becomes:

A special case of Chinese restaurant process
The process of distributing users over multiple regions is a special case of Chinese restaurant process (Aldous, 1993), given that |R| is finite.Whenever a new user joins the OSN, he/she needs to choose a region to stay, by considering the distribution of his/her friends in the given regions: When the OSN has a strong geo-homophily over the regions, people prefer to communicate and stay with their friends in the same region.
When the OSN owns a weak geo-homophily, users may communicate with online friends in a region but stay with offline acquaintances in another different region.

Parameters in the view of stick-breaking representation.
Although n r 's are statistic variables that can be obtained directly, the scale parameters are not easy to compute.To avoid manual assignment of a r , we change our view of the problem to be an equivalent one, i.e the stick-breaking representation.
The posterior distribution of u over ũ is deduced as: So we have: where d ũ is a probability measure concentrated at ũ .Consider a partition (u 0 ,U\u 0 ), we have: Serialize each region from 1 to |R|, and the stick-breaking procedure is then deduced by: where b i $ Beta (1,a) for i = |R| and b |R| = 1, whereas: The posterior distribution of b i satisfies

Floating population inference
In the physical world, people may move across regions periodically or temporally, thereby greatly influencing the geo-homophily of the OSN they use.In general, there are two important regions for every person, that is, the home region denoted as h, and the remote region denoted as ?(e.g. the work place).According to the previous study (Cho et al., 2011), most of the message diffusions usually occur in or between these two regions (e.g. an OSN user in the remote region contacts his families at home region or his colleagues at the same remote region).With these observations, we leverage the geoattributes of message diffusions between the sender and receiver to infer the distribution of FP.

Distribution of message diffusions
We use a tetrad S = (C,Q,l ,x ) to represent the state of the message diffusion graph.Consider a state when the population distribution is captured as C = {c i } i[R , where c i represents the proportion of the population of region i.
Denote the real population distribution as Q = {q ij } i , j[R , where q ij means the proportion of people whose remote region is region j, whereas their home region is region i.We have c i ¼ X j2R q ji ::.Let s ij be the proportion of users in j with home region i, i.e. s ij ¼ q ij c j .Similar to UDM, s ij 's in a specific region can also be generated from a DP.Given a sender region, the amount of region-to-region communication is proportional to the population of the receiver region.Then for every receiver region r, we have: where l is the proportion of communications with the home region, and x is the proportion of communications with the remote region.
State difference.Define a baseline state , where all people stay at their home regions, i.e.Vi, j [ R, i = j, the corresponding s 0 ij = 0. Consider the difference between an arbitrary state S and the baseline state, named as state difference DS.
P2.The state difference follows a superposition of a uniform distribution and a Dirichlet distribution.Proof.The proportion of messages from r ? to r t should be: Population distribution projection Therefore, we can deduce that: which is a constant plus a variable generated from DP.It indicates that the state difference follows a superposition of a uniform distribution and a Dirichlet distribution.This proposition enlightens us to infer FP by methods of divide and conquer.The state difference reduces the weight of the uniform distribution component.

Export message pattern
Similar to UDM, we can extract the distribution of messages diffused to remote regions, and we use a Hierarchy DP to find the distribution, which is named as the export message pattern (EMP).For every region i, denote r i as {s ji } i= j , following: where h i is the hyper-parameters, t i and t 0 is the corresponding scale parameter and B 0 is the base distribution.Consider the differential export message: which satisfies that: Given d i , Gibbs Sampling can be used to decide what r i should be.

Self message pattern
The DPM can also explain the distribution of messages diffused inside each region, which is named as self message pattern (SMP).According to equation ( 7), it is not wise to gather s ii Vi[R.Instead, we should concern {s 0i = 1-s ii } i[R and denote it as r 0 .We are able to find a scale parameter t 0 and base distributionI 0 such that: IJCS 1,3 Because we have access to: , the model can be solved by Gibbs sampling according to the posterior distribution and the restriction holding X j s ji ¼ 1.

Floating population inference model
Finally, we combine UDM, EMP and SMP as a floating population inference model (FPIM).
The UDM provides the population distribution across regions, whereas EMP and SMP compute the specific allocation inside each region.The model structure is shown in Figure 2.

Evaluation
In this section, we validate the geo-homophily over two real-world OSN data sets that have geo attributes of users and evaluate the performance of proposed UDM and FPIM models.

Data sets
We use data sets of two OSNs: Gowalla data set and WeChat Moment data set.The former one covers most of western countries, whereas the latter covers the China mainland (where Internet censorship is enforced and people have restricted access to popular OSN sites/apps like Facebook)."Gowalla" (Cho et al., 2011) is a typical LBSN where users share their locations by "checking-in".The information regarding friend relationship was collected using their public API, which consists of 196,591 nodes and 950,327 edges.The edges can be seen as undirected.This Gowalla data set collects a total of 6,442,890 check-ins of these users over the time period from February 2009 to October 2010.
"WeChat Moments (WM)" (Schiavenza, 2013) is the social network of a mobile messaging app (Wechat) popular in China, where the contents shared over WM are HTML5 pages Population distribution projection (Zhang et al., 2016).This WM data set contains 137,509,889 users with 1,671,692,424 retweeting/forwarding records of 329,465 pages from January 14, 2016 to February 27, 2016, telling us when, where, from whom a page is re-tweeted; how many pages a user reads; and whether one has re-tweeted a page.WM can only be used on mobile devices, and the user location can be inferred from the IP address.The period of data covers Spring Festival, a traditional festival in China when most of Chinese people migrate back to their home province from the work place.Note that although the number of users of each data set is much less than the population of a country, it is sufficiently large to derive the proportion of OSN user distribution, as well as the population distribution over geographical regions, which helps us determine how a new OSN user is distributed or how FP varies across regions.

Geo-homophily of online social networks
As mentioned in Section 3, we can divide users of an OSN into a division of regions, according to users' geo attributes.
6.2.1 Geo-homophily of WeChat 6.2.1.1Message diffusion in China.The WM data set records the page re-tweeting in 34 provinces in China, and we use these provinces as the geographical regions in this experiment.Every user in WM should have viewed a collection of pages, and each page view's IP address corresponds to a province, among which the most frequently recorded one is set as the province where the user is located.We analyze the message diffusion process in two time periods: (1) Before Spring Festival, we monitor the message diffusion from January 14 to January 31, 2016, which are pre-holiday working and weekend days.(2) On the Spring Festival day, most people stay at home, and hence the structure of the message diffusion graph would be different.

Results
for pre-holidays.The modularity in the pre-holidays is approximately 0.49. Figure 3(a) shows the volume of message diffusion inside each province and that between  5, where the geo-homophily will not decrease when G G; L À Á !X L T and vice versa.This implies that the stability will not decrease when the previously mentioned condition is satisfied, which is consistent with Proposition 1.

Performance of user distribution model
Given the order of users' joining the OSN and their home regions, we are able to train the UDM.We evaluate the performance of UDM on WM and Gowalla data sets, respectively.On WM data set, we monitor the order of 30 million users' joining the OSN and then predict the distribution of the next 10 million users, which are tested by 10 experiment runs (each run contains one million users).By comparing the ticks over the x-axis and y-axis of Figure 7(c), we observe that FPIM predicts a FP (ticks over the x-axis) lower than that obtained in the national census (ticks over the y-axis).This can be attributed to the fact that a non-negligible proportion of FP do not view the WM pages or may not even use WM.As mentioned earlier, although there exist people not covered by the WM data set, the number of users in the data set is sufficiently large to derive the distributions using the proposed models.
6.5.2Prediction correctness.Apart from correlation, we always have a concern on the densely populated region which has the most of a large FP that may cause changes to the online and offline social networks.We use the sets of regions who have the most proportion of FPs to measure the prediction correctness of FPIM.
For every province r, FPIM calculates the proportion of FP, by two sets, i.e. the set of emigrants whose currently located region is the remote region but the home region is r, and the set of immigrantswhosecurrentlylocatedregionistheirremoteregionrbutthehomeregionisdifferent.
Then, we rank the provinces by the number of immigrants and emigrants and obtain a ranking of provinces on immigrants and a ranking of provinces on emigrants, respectively.Meanwhile, the FP data of the national population census can also produce two rankings of provinces on immigrants and emigrants.
We compare the corresponding rankings obtained from FPIM and the national census and calculate the overlapping rate between two rankings on immigrants or emigrants (which is defined as the number of regions that appear in both rankings divided by the total number of regions in a ranking) over the top-y provinces according to their normalized proportion values.We vary y from 1 to 10 and plot the histogram in Figure 7(d), telling that FPIM works satisfactorily with a match between our prediction results and the data of the national census.The two types of rankings have a high consistency on DS.Besides, the correctness on DS is higher than that calculated on basis of S the performance of FPIM on predicting the set of top provinces on emigrants is better than that on predicting the set of top provinces on immigrants, which is a result of the fact that FPIM uses s ij 's which pay more attention on the emigrant proportion.

Conclusions
In this paper, we propose a systematic study on the population distribution projection over offline geographical regions by analyzing the geographical attributes of OSNs.We propose the concept of geo-homophily in OSNs to establish the correlation between online message diffusion and the stability of geographical regions where a population distribution can be drawn.We formulate the population distribution problem from the perspective of DP and present prediction models to show the process that OSN users are distributed into regions, and infer the FP across regions.By experiments over the large-scale data sets, it is shown that the online message diffusions can help evaluate the stability of geographical regions, which further facilitates the determination of population distribution over fixed regions; the proposed prediction models have a high prediction accuracy in inferring the change of FP across regions. Notes Figure 1.A three-layered analytical framework that defines the geohomophily of OSNs to map the OSN message diffusion (Layer 1) to the offline user distribution (Layer 2) and infers the FP (Layer 3) based on the derived distribution (Layer 2)

Figure 2 .
Figure 2. The FPIM has three parts, namely, UDM, EMP, SMP from left to right Figure 3. Graph representations of message diffusion inside and across provinces in WM data set Figure 4.The viewing distribution of a hot message originated from Beijing Figure 7. Results of inferring the FP that has excluded those whose home and remote regions are the same