Structural Aspects of User Roles in Information Cascades

Social media plays an important role for the exchange and dissemination of information among its users. In turn, online users shape social media by their interactions, status and online behaviour in general. These aspects differ massively from user to user, which has an impact on the outcome of information diffusion. Users on social media have been categorised to user roles according to their online behaviour. While there has been a lot of research on user roles and information diffusion in isolation, their combination has not been researched much. In this paper, we study their correlation in particular whether particular user roles occur in specific structural positions in information cascades. By testing several hypotheses we could confirm that there is indeed a correlation of these two aspects. However, some user roles demonstrate diverse behaviour with regard to their activity patterns and need further investigation.


INTRODUCTION
Information diffusion researches the processes of how a piece of information propagates on social media. Such analysis considers structural aspects, i.e., information propagating from user to user over social graphs and social aspects, i.e., real users acting in the virtual space of social media. Information diffusion is modeled with information cascades. Information cascades are graphs that reveal how information propagates from user to user, often with the assumption of an underlying social graph. Online users shape information diffusion processes by their interactions and online behaviour. Such behaviour differs from user to user which has led to the identification of prominent user roles in the literature [23,2].
While research on information diffusion [9,16] and user role identification [23,2] in social media have each received considerable attention, the correlation of these two aspects has not been investigated much. In this paper, we research the correlation of 1) structural aspects of information diffusion and 2) user roles that are derived from online human behaviour. In particular we seek to identify structural positions of user roles in information cascade graphs. For example: Are celebrities mostly at the root of cascade graphs? Are spammers at the leaves because they do not trigger further reactions? The evaluation of such hypotheses we can shed light on the mechanisms of information diffusion. By specifying the structural positions of user roles, we can better predict the outcome of information diffusion: for example a particular user role that is observed more often at the leaves might signify the end of information diffusion.
The remainder of the paper is structured as follows: In Section 2 we discuss related work and in Section 3 we describe our dataset. Section 4 provides the methodology and results for reconstructing information cascades, while in Section 5 we identify prominent user roles. Section 6 investigates the correlation of information cascades and user roles by computing structural positions on information cascades and testing several hypothesis. Finally, Section 7 concludes the paper.

RELATED WORK
Information diffusion in social media has been a board field of research. In this respect, models of information diffusion are developed like [10,19,8] and information cascades have been researched in mamy contexts [6,16,26,7,15]. We provide some examples of analysis over information cascades. In [25] authors investigate the size, shape and decay factors of cascades during the Irianian "Green Revolution" in 2009 while in [14] the shapes and temporal metrics of retweet cascades were evaluated. The authors in [11] investigate human interactions on a emergency event constructing the corresponding cascades. In [13] the impact of location, time and distance is examined with regard to information adoption, and the list continues to grow.
For identifying prominent user roles we discern two categories: 1) supervised methods like in [23,1] where a framework is used and user roles are adapted to this framework according to the selected features; 2) unsupervised methods like in [4,21] where datasets drive the cluster creation (number and quality of clusters not known in advance) and results need to be interpreted accordingly.
In more detail, the work in [23] develops a model based on the Twitter message exchange processes to identify key players in conversations. This model categorizes Twitter users into specific roles based on their dynamic communication behavior. The work in [1] applies a semantic model combined with statistical analysis to compute behaviour in online forum communities. This model categorises behaviour of forum community members over time, and researcher how different behaviours correlate with community growth in these forums. Analysis of user intentions in Twitter was implemented by [12], were the intention of each post was deter-mined manually. The user intentions discovered include: Daily Chatters, Conversations, Sharing Information and Reporting news. For unsupervised methods, authors in [4] cluster users in Twitter according to their activity. The number and the quality of clusters are not known in advance. In the same lines, the work of [21] clusters forum users, according to their posting behaviour with the combination of semantic rules.
While information diffusion and user roles has individually been studied in the past, much less is known about the connection between them. The work of [17] identifies communities of users on top of information cascades: in our case, we decouple features to detect user roles and information cascades since we aim to identify the correlation of both. Authors in [18] correlate diffusion processes with the evolution of the underlying social graph. This problem has been adapted to a probabilistic generative model [5] that allows the understanding and reproduction of such processes. The closest to our work is [24] which studies the interplay between users' social roles and their influence on information diffusion. The model proposed in this work integrates social roles and diffusion modeling into a unified framework. Such a model can be used to predict whether an individual user will repost a specific message in the micro level; at the macro-level, the model can predict the scale and the duration of diffusion processes. However, our goal is different since we are trying to identify structural positions of user roles in information cascades.

DATASET
The dataset we are using was recorded during the 2012 summer Olympics in London using terms like "olympics", "london2012". It contained 13.6 M messages, 2.27 M distinct users and 1.1 M retweet cascades.
In order to obtain reliable results for computing the structural metrics, we filtered out cascades with size lower than five messages. We ended up with 4.618 cascades which is the dataset we are using to compute the metrics presented in Section 4. For identifying prominent user roles we considered 13.56 M users who contributed at least two messages during the 2012 Olympics.

INFORMATION CASCADE RECONSTRUCTION
In this section, we present the methodology to reconstruct information cascades. This will allow us to compute structural metrics on top of information cascades that reveal structural positions (in Section 6). We focus on retweet cascades, but these methods can be applied to other diffusion processes (e.g. hashtags or replies) and different social media.
When users are retweeting, Twitter provides the initial source of a message (root), but not the intermediate forwarders that influenced them. In other words, the intermediate diffusion paths are not provided by Twitter. Under the hypothesis that information flows through social connections (users are exposed to information from their followers), we leverage the social graph to search for possible influencers and unravel the intermediate diffusion paths. We use our algorithm from [22] that reconstructs retweet cascades, given some social graph. This algorithm allows multiple influencers in case more than one of the user's followers are (re)tweeting the same message. As a result, retweet cascades are DAGs, with a single root. Note here that in other means of propagation, e.g. hashtags, we might observe multiple roots. Figure 1 presents the distribution of cascade sizes and diameters. As shown in Figure 1a, cascades have a skewed distribution of size with the large majority yielding a few reactions. The largest cas-  [22] cade has 62K of messages while 5K cascades have more than 100. Figure 1b shows that cascades tend to be deep, with a mean value of diameter 4. Diameter of size up to 18 are observed, indicating that information is being propagated to large audiences beyond the root's followers. This has an impact on cascade shapes, which results in complex structures as well as star structures. Having a diversity of cascade shapes serves well our purposes of computing several structural metrics on them.

USER ROLE IDENTIFICATION
Next, we identity prominent user roles according to their online behaviour from the dataset in Section 3. We do not rely on any predefined model of influence or any pre-knowledge of user groups as in [23]. For such analysis we need 1) features that characterize users and their activity and 2) a clustering algorithm that groups users with similar features together. We considered features that reveal: • status: number of followers, number of friends, number of times being mentioned, is verified, has url • activity and engagement: number of tweets, number of retweets, number of replies, number of mentions, number of off-topic messages (messages that do not refer to the crawled dataset according to keywords) • ability to trigger further reactions: retweet and reply reaction rates (fraction of messages that receive at least one retweet and reply) Note that, no information diffusion aware features were used to identify user roles. As a result, there is no beforehand correlation of user roles and structure in cascades as in [17].
After extracting features for every user, we need to select a clustering algorithm that will unravel distinct groups of users in the data. We also need to also define the number of desired clusters, since this information is needed by most of the clustering algorithms. We tested K-Means and Expectation Maximization (EM) clustering algorithms; EM assigns a probability distribution to each user which indicates the probability of belonging to each of the clusters. This is very useful in cases where users fall between more than one cluster or their behaviour deviates over time. In order to assign each user to one cluster, we get the maximum probability for each user to identify the most "fitting" cluster. Both methods require the number of clusters k to be provided in advance.
Since we have no a priori information about the number and quality of clusters, we have to define an objective function that shows the best clustering approach and number of clusters for our dataset. Our goal is to a) maximize the cohesion of individual clusters and b) maximize the separation among clusters, so that we end up with well-defined clusters. In practice, the similarity of data items within each cluster, and the dissimilarity of data items among different clusters have to be maximized. The similarity and dissimilarity can be computed by any distance metric like the Euclidean distance.
We used the Silhouette coefficient [20] which accounts for these aspects. It takes values from [−1, 1] and the higher its value, the better the clustering is. We tested several number of clusters (according to literature (e.g. [21,4]) in the range of [3,20] for Kmeans and EM. The optimal number of clusters for both methods was nine which was identified by testing the Silhouette Coefficient of different clusterings on this range.
EM yielded the best results for all clusterings within the aforementioned range. The Silhouette was 0.36 (compared to 0.29 for K-Means) and we present the results for EM in the remainder of the paper. The probability distributions produced by EM showed that at least 75% of the users have a probability higher than 0.9 to belong to the first assigned cluster.
We inspected those clusters and interpreted them according to the feature distributions. We observed that five clusters (out of nine) bear very minimal differences in the feature distributions and we could not identify any distinct behaviour. As a result, we decided to merge those cluster and assume that correspond to similar user behaviour. The reason for this is the highly skewed data: most users have very low activity and the majority of messages is contributed by a small fraction of users. Complementary we observe a hierarchical structure within clusters (which also explains the aforementioned results) that shows smaller but noticeable differences among users in the same cluster. The rest of the analysis considers five distinct clusters which are presented in Figure 2.
The five user roles that we identified include: • Stars: This user role includes extremely popular users (e.g. celebrities, athletes). As seen on Figure 2a stars have extremely high number of followers and they are selective in whom to follow. Elitistics from literature are found in this cluster [3,1]. They are not so active as users in other clusters, but their messages receive many reactions. They are also mentioned very often, mostly because they are famous. In most cases they are verified and have a url in their profile.
• Information Sources: Users in this cluster are news sources and popular users in particular domains, e.g. bloggers.
They have a high number of followers but the gap between followers and friends is not so extreme as in the case of stars. They are extremely active and engaged, but at the same time they trigger many reactions They are also more conversational compared to stars, indicated by the number of replies. They are being mentioned less than the stars and they are not always verified (e.g. bloggers recognised in particular domains).
• Daily chatters: These users are the most prolific writers (compared to all clusters) propagating both original information and retweeting. They are not so popular and recongnised as the previous clusters. They are mainly talking about their daily routines and reproducing information of what is happening around the world.
• Listeners: These accounts contribute rarely, do not receive reactions and have significantly more friends than followers. Note that this cluster is under-represented in this dataset, since users with more that two messages during Olympics 2012 are considered which is already excluding the true Twitter listeners.
• Average users: This category falls in between of daily chatters and listeners. These users are relatively active, receiving some reactions. They have a comparable number of followers and friends. Amplifiers [23] are also found in this category that receive information and propagate it further. Note that this user role includes five merged clusters and contains the majority of users. This means that the dominant cluster of average users has small variations which are not easily interpretable.
We examined representative users and their activity in each cluster to confirm the cluster interpretation. Similar user roles were also identified in the literarture [12,4,1,3]. Any differences with stateof-the-art lie in the different features selected and the differences of social media platforms evaluated.

STRUCTURAL POSITIONS
After reconstructing information cascades and identifying prominent user roles, we can correlate these two aspects by investigating which positions different user roles occupy in information cascades.
For that, we need to define and compute metrics on information cascades that reveal structural positions for each user. Such metrics will reveal the influence exerted by users and their centrality in information cascade graphs. We compute the following metrics that show influence and centrality in the cascade graphs for each user: • (shortest) Distance to the root shows whether particular nodes are roots or close to the root, which means that they are influential or have fast access to information.
• (shortest) Distance to the leaves reveals nodes who do not trigger significant further reactions.
• Closeness centrality measures the distance from a node to all other nodes which demonstrates how central a node in the graph is.
• Betweeness centrality measures the number of shortest paths that pass through a node and reveals the amount of information flow that a node controls.
• Root influence measures the fraction of nodes who reacted directly to the root and reveals how influential the root is compared to other nodes in the cascade.
• Indegree shows the number of different influencers or the amount of influence a node needs to react to incoming information.
• Outdegree shows how many nodes a particular node influences.
Next, we compute the distributions of these metrics for the different user clusters that were identified in Section 5. We assume that different user roles will demonstrate considerable differences in terms of their influence and centrality in information cascade graphs.
In order to model behaviours for separate user roles, we associate each of the metrics with intensity levels (low, medium, high) according to the range of their distribution. We split the observations of every metric (for each user role) in three equizised quantiles (0-33.3% for low, 33.3-66.6 % for medium and 66.6-100 % for high): this facilitates the comparison of different user roles with regard to these metrics. A similar approach was followed by [21]. By doing this we can answer questions like: Do stars have a high outdegree compared to the other user roles?
We form similar questions/ hypotheses that are going to be confirmed or rejected according to evidence from the data.
The hypotheses that were successfully confirmed include: • Stars are creating original content and they demonstrate overall low indegree.
• Stars are influencing a lot of others and they have high outdegree.
• Since stars are influencing many others directly, they should also be "close" to them in the graph and demonstrate high closeness centrality.
• Stars are often observed at the cascade root.
• When stars are at the root, they have high root influence; their friends are reacting mostly because they are famous.
• Daily Chatters and Listeners are positioned at the leaves because they fail to trigger further reactions.
• Daily Chatters and Listeners are not central in the cascade graph and have low betweeness centrality.
We did not collect enough evidence that positions daily chatters and listeners in the periphery of the graph by demonstrating low closeness centrality. In reality, it is often the case that daily chatters are influenced directly from the root because of their fast reactions, which also brings them "closer" to other nodes. For information sources we failed to confirm any hypotheses, since this user role acts either as root, or can be found within diffusion paths. Given their diverse behaviour of being popular but at the same time being engaged with others, they can occupy multiple positions over information cascades.
We also failed to confirm any hypothesis for the average user: these users can take multiple positions on the cascades either in the middle as amplifiers or at the leaves.
The aforementioned hypotheses were tested with the non-parametric Mann-Whitney-Wilcoxon test. The hypotheses that were confirmed, were statistically significant on the 0.001 level. We tested the differences in means for the users in each user role versus the full user population and the differences in means according to the three quantiles (low, medium, high).
In general, we can observe that user roles at the ends of the spectrum (stars, listeners and daily chatters) are correlated with cascade structure. The user roles of information sources and average user needs further investigation in terms of their behaviour, since these users occupy multiple positions in the cascades. Also, we need further evaluations to understand the subtle differences of daily chatters and listeners into the information cascades. These two user roles seem to have very different behavioural patterns but they occupy similar structural positions. In order to validate the importance of such analysis, we will further evaluate these hypotheses in larger datasets and more social media platforms.

CONCLUSION AND FUTURE WORK
In this paper, we presented a study that correlates user roles with structural aspects of information diffusion in Twitter. While we identified particular user roles that correlate with the cascade structure, this work has some limitations. For the user roles that constitute the core of social media (average user and information sources) we failed to confirm any hypotheses and we need to investigate further their online behaviour. For future work, we plan to identify information cascade shapes (stars, complex structures, long paths, etc) and correlate such shapes with user roles. This analysis will help us to gain a better understanding into human interactions and influence in social media and provide valuable insights for information diffusion processes.