c ⃝2014 IEEE. This is the submitted version of the paper (with some erroneous experimental results corrected, which are designated in red). The final revised version is available at: http://dx.doi.org/10.1109/WI-IAT.2014.45 Online Retweet Recommendation with Item Count Limits Xiaoqi Zhao Keishi Tajima Graduate School of Informatics Kyoto University Yoshida-Honmachi, Sakyo, Kyoto 606-8501 Japan Email: [email protected] Graduate School of Informatics Kyoto University Yoshida-Honmachi, Sakyo, Kyoto 606-8501 Japan Email: [email protected] Abstract—Some Twitter accounts provide information to the followers not by publishing their own tweets but by forwarding (i.e., retweeting) useful information from their friends. These accounts need to select an appropriate number of tweets that match the followers’ interests. If they retweet too many or too few tweets, it annoys the followers or degrade the value of the accounts. They also need to retweet them in a timely manner. If they retweet a tweet long after they receive it, the informational value of the tweet may diminish before the followers read it. There is, however, a trade-off between these two requirements. If they select tweets after seeing all the candidates, they can select the best given number of tweets, but in order to provide timely information, they have to decide to (or not to) retweet each tweet before seeing all the following candidates. In order to help the management of such Twitter accounts, we developed a system that reads a sequence of tweets from the friends one by one, and select a given number of (or less) tweets in an online (or near-online) fashion. In this paper, we propose four algorithms for it. Two of them give priority to the timeliness, and make a decision immediately after reading a new tweet by comparing its score with a threshold. The other two give priority to the selection quality, and make a decision after seeing some following tweets: after seeing the following tweets belonging to the same time subinterval or after seeing a fixed number of following tweets. The former two are truly online algorithms and the latter two are near-online algorithms. Our experiment shows that the near-online algorithms achieve high selection quality only with acceptable time delays. I. I NTRODUCTION Recently, microblogging services are rapidly growing. Twitter, which is the most popular microblogging service, has over 115 million monthly active registered users, sending 58 million messages (called tweets) per day [1]. In Twitter, a user follows other users so that all the tweets by them are shown on the user’s Twitter interface, called timeline. Those who follow a user are called followers of the user, and those whom a user follows are called friends of the user. Twitter is not only playing a role of communication media, but also used as a medium for collecting useful real-time information. About 40% of active Twitter users do not submit tweets but just watch other people’s tweets [2]. For those users, Twitter is a tool for real-time journalism [3]. For those users, however, it is not easy to find useful information buried in the huge number of tweets. To help those users to collect useful information on Twitter, accounts that play the role of “portal sites” on Twitter have appeared. These accounts provide their followers with useful information mainly by retweeting (i.e., forwarding) tweets from their friends, rather than tweeting their own original tweets. In this paper, we call those accounts portal accounts. The daily task of the administrators of such portal accounts is to read all tweets from the friends, and select tweets to retweet that best match the followers’ interests. They need to select an appropriate number of tweets in each time interval, e.g., in every 24 hours. If they retweet too many tweets, it annoys the followers and make them stop following the accounts. If they retweet too few tweets, it also degrades the value of the accounts as a portal, and they lose the followers. In addition, they need to retweet tweets in a timely manner. According to [4], the life cycle of a tweet is only 48 hours. If they retweet a tweet long after receiving it, the informational value of the tweet may diminish before the followers read it. Even when the value of a tweet lasts longer, if the followers receive the same tweet from other sources before receiving from the portal account, it degrades the value of the portal. Even when these are not the case, the followers want to receive fresh information. Making the followers feel they are “first to know” the information will add value to the portal account [5]. For these reasons, it is important to reduce the time lag between arrival and retweeting of tweets. As explained above, portal accounts on Twitter need to satisfy two requirements: selection of an appropriate number of good tweets and prompt forwarding of them. There is, however, a trade-off between these two requirements. If they select tweets after seeing all the candidates in some time interval, e.g. at the end of a day, they can select the best given number of tweets. In order to provide timely information, however, they have to make a decision for each tweet before seeing all the following candidates. Once we retweet a tweet, we cannot withdraw it later even if we find many following tweets are much better than the previous one. We can delete it from Twitter, but some followers may have already read it. These mean that this is a typical environment where we need some online algorithm. In addition, manually selecting all retweets is a hard task for the administrators of portal accounts. Portal accounts usually have many friends and many followers. The administrators have to monitor a huge number of tweets from the friends for 24 hours a day. They also have to analyze the interests of many followers in order to select tweets best matching their interests. It would be better for the administrators if these procedures are automatated. In this paper, we propose a system that automates these tweet evaluation and selection processes. Our system reads a sequence of tweets from the friends one by one, evaluate how much each tweet match the followers’ interests, and select a given number of tweets (or less if there are not enough tweets). We evaluate a tweet by comparing its feature vector with the feature vector produced from the profiles and tweets of the followers. The selection is done in an online or nearonline fashion. When a new tweet arrives, the system decides to retweet it or not immediately, or after seeing some (but not all) following tweets. This system automatically retweets an appropriate number of good tweets within short delays. For the selection process, we propose and compare four algorithms in this paper. The basic strategies of the four algorithms are summarized as below: • timeliness-oriented approaches (online) 1) history-based threshold estimation 2) stochastic threshold estimation • selection-quality-oriented approaches (near-online) 3) selection at every time interval 4) selection at every n items The former two algorithms give priority to the timeliness, and make a decision immediately after reading a new tweet. These two algorithms use threshold to make a decision. The first algorithm computes the threshold by taking the average of offline-computed best thresholds for the past time intervals. The second algorithm computes the threshold based on the estimated probability distribution of the values of tweets. The other two give priority to the selection quality, and permits some delay of retweeting. They make a decision after seeing some (but not all) following tweets. In the third algorithm, the system divides time into subintervals, and periodically select tweets at the end of each subinterval from the tweets arrived within that subinterval. In the fourth algorithm, the system periodically selects tweets every time it receives n tweets, where n is computed based on the number of recently received tweets. The former two are truly online algorithms and the latter two are near-online algorithms. In order to create a dataset for the evaluation of these four algorithms, we selected some real portal accounts on Twitter, and collected the tweets by their friends. In order to infer the interests of the followers of the portal accounts, we collected the profiles of the followers, and also collected tweets by the followers. Our experiments on this data set shows that the two near-online algorithms achieve high selection quality with only short acceptable time delays in retweeting. The remainder of the paper is organized as follows. In Section II, we review some related research. In Section III, we formally define our problem. In Section IV, we explain our method of evaluating how much a tweet match followers’ interests. In Section V, we describe our four algorithms for tweet selection. In Section VI, we evaluate our four algorithms by experiments. Finally, we conclude the paper and address future research directions in Section VII. II. R ELATED W ORK In this section, we review related work on retweet recommendation, and also some theoretical work on problems related to our problem. A. Retweet Recommendation There have been much research on Twitter, such as research on user classification, hot topic detection, recommendation of tweets to read, and friend recommendation. On the other hand, there have been only a little research on retweet recommendation, i.e., recommendation of tweets that the target user may want to share with the followers. The survey paper by Kywe et al. [6] explained that there was no paper about personalized retweet recommendation when the paper was written. The authors, however, also pointed out that the work by Uysal et al. [7] and the work by Nasirifard et al. [8] can be applied to retweet recommendation as explained below. Uysal et al. [7] developed a classifier to predict the probability that a given user retweets a given tweet. They model the user’s past retweets by using four types of features: authorbased, tweet-based, content-based and user-based features. This classifier can be used for retweet recommendation. Nasirifard et al. [8] developed a system that helps various activities on Twitter, and it can determine whether the hashtags appearing in a tweet are relevant to the user’s followers. This function can also be used for retweet recommendation. To the best of our knowledge, the only existing study on personalized retweet recommendation is the study by Wang et al. [9]. Their method ranks tweets by using the information on the past retweeting activities of the target user. The ranking score of a tweet is computed from three kinds of information: (1) how often the target user has retweeted tweets from the sender of the tweet, (2) which tweets the user retweeted or not retweeted in the past, and (3) which retweets by the user in the past had a big impact on the followers. There have been also studies on the analysis of users’ retweeting behavior and prediction of future retweets [10], [11], [12], [13], [14], [15], [16]. Their primary purpose is not recommendation of tweets to retweet, but these results can also be used for retweet recommendation. For example, Macskassy et al. [12] proposed four models of users’ retweeting behavior. The first model simply assumes that a user retweets recent tweets with higher probability than older ones. The other models add some factor to this base model. The second model adds a factor that gives higher probability to tweets from users with whom the target user recently interacted through retweeting or direct messaging. The third model adds a factor that gives higher probability to tweets whose topic is similar to the target user’s topic of interest. The fourth model adds a factor that gives higher probability to tweets from users whose interests are similar to that of the target user. The result of their experiment with real Twitter users shows that a combination of more than one model can better explain the behavior of any user, but it also shows that the third model based on the similarity between tweets and user interests works best in average. On the other hand, Peng et al. [13] used CRFs (Conditional Random Fields) to model the behavior of retweeting users. The model includes three types of features: features defined for each tweet, features defined for each combination of a tweet and a user, and also features defined for each combination of a tweet and two users (e.g., the author of the tweet and a user retweeting it). Some features are related to temporal factors. Their experimental results show that the features defined for a tweet and a user are the most predictive indicators. Both [12] and [13] conclude that the topic similarity between tweets and user interests is the most important. Following their results, we also use similarity between tweets and user interests for tweet evaluation. [12] and [13], however, construct a model based on users’ retweets in the past. It means that their models mainly predict which tweets the users want to retweet. On the other hand, our purpose is to help management of portal accounts by recommending tweets that the followers wants to read. Because of this, we use similarity between tweets and the interests of the followers, instead of similarity between tweets and the interests of the target users. Vu et al. has proposed a system that aggregates information produced by group members on various online services into one social stream. The system filters out irrelevant or private information from the stream, and provide only relevant information to all the group members for information sharing [17]. This idea is similar to the idea of portal accounts on Twitter. All the studies above are related to retweet recommendation, but none of them discussed the problem of selecting appropriate number of tweets in an online or near-online fashion, which is the main concern of this paper. There have been also much research on recommendation of tweets not for retweeting but for the target users’ own reading. When we recommend tweets for the target user her/himself, we do not need to care about the number of tweets we recommend. We simply show the ranked list of recommended tweets to the user. Then the user reads the list starting from the top-ranked tweet, and stop when the user runs out of time. Therefore, there have been no research on tweet recommendation discussing the problem of the number of tweets to recommend. candidates one by one, and after each interview, you have to decide to hire the candidate or not immediately. The goal is to maximize the total score of the hired candidates. Another related problem is an online knapsack problem [19], [20]. In an online knapsack problem, items with a value and a weight are shown to you one by one, and after seeing each item, you have to decide whether you keep the item in your knapsack or not immediately. You need to maximize the total values of the selected items while keeping the total weights within the given capacity. The multiple-choice secretary problem is a special case of the online knapsack problem where the weights of all items are equal. There are variations of the online knapsack problem. For example, the total number of items may or may not be given in advance, and the distribution of item values may or may not be given. In our case, both are not given. Therefore, in order to apply some existing algorithm that requires either of them, we first need to estimate it. III. P ROBLEM D EFINITION In this section, we give the formal definition of our problem. Let u be the current target portal account. V (u) = u } denotes the followers of u, and D(u, T ) = {v1u , . . . , vm u u ⟨d1 , . . . , dn ⟩ denotes the sequence of tweets posted by the friends of u during a time interval T = [ts , te ], where du1 is the oldest tweet. t(dui ) denotes the posting time of dui and s(dui , u) denotes the score of dui for u. Note that t(du1 ) < t(du2 ) < · · · < t(dun ). The details of how to compute s(dui , u) is explained in the next section. Our recommendation system chooses a sub-sequence of D(u, T ) including at most c tweets. Let D′ (u, T ) = ⟨duc1 , duc2 , . ∑ . .⟩ denotes the selected sub-sequence. Our goal is to maximize d∈D′ (u,T ) s(d, u) while keeping |D′ (u, T )| ≤ c. IV. T WEET E VALUATION P ROCESS Our recommendation process consists of two sub-processes: whenever a new tweet arrives, we first compute the score of the tweet (evaluation process), and we decide to or not to select it (selection process). In this paper, however, we focus on the latter, i.e., the selection process, for the sake of space limitation. In this section, we explain how we compute the score of tweets only briefly. As explained in Section II, we compute it based on the topic similarity between the tweet and the followers’ interests. B. Secretary Problems and Online Knapsack Problems As we explained in I, the recommendation for a portal account should satisfy the following requirements. 1) Maximize the match between selected tweets and users’ interests. 2) Recommend at most a given number of tweets in each time period. 3) Decide to recommend a tweet or not as early as possible. Such a problem is known as a multiple-choice secretary problem [18]. In a multiple-choice secretary problem, you want to hire a given number of secretaries, you interview A. Follower Filtering Followers of a target user, however, may include many inactive users. In retweet recommendation, we should eliminate inactive users, and focus only on the interests of active followers. Login timestamps are the best clue for distinguishing active users from inactive users, but it is not publicly accessible through Twitter API. In this paper, we regard an account is inactive if it satisfies both of the following conditions: • it posts less than one tweet per week, and • the number of its friends has not changed for 60 days. B. Estimation of Followers’ Interests C. Evaluation of Tweets Next, we estimate the interests of the active followers. Some social networks services, e.g., Facebook, have metadata explicitly describing categories each user is interested in (although some users may not provide information). Twitter, however, only has a free-format profile data, and many users do not provide detailed information in their profiles. In order to obtain as much information on the followers’ interests as possible, we use text in their profiles, and also use text in tweets posted by them. There are various ways to compute a vector representing the topic of the tweets or the profiles. In this paper, however, our main concern is not the evaluation process but the comparison of four algorithms for the selection process. We, therefore, use the most basic tf-idf method instead of other approaches, such as LDA (Latent Dirichlet Allocation). Before any processing, we preprocess all text contents in our dataset by word stemming, stop word elimination, and unicode normalization. We also eliminate all non-ascii characters, such as Cyrill characters and Greek characters. After that, for each follower v of u, we first produce a standard tf-idf vector from the tweets by v in the following way. For each combination of a word w and a follower v, we calculate its tf value tf (w, v), and for each word w, we also calculate idf (w), idf of the word w in the Twitter world, by the formula below: In the same way as we produced the vector ⃗v (u) for the followers, we produce a vector d⃗ui for each incoming tweet dui in D(u, T ). We apply the same preprocessing to the text contents of dui , extract words used in the vector ⃗v (u) from dui , create a tf vector (not tf-idf vector), and normalize it so that its norm is 1. More formally, d⃗ui is defined as follows: tf (w, v) = idf (w, u) = occur (w, v) ′ w′ occur (w , v) |{vi |vi ∈ V (u), tf (w, vi ) > 0}|/|V (u)| log |{vi |vi ∈ U, tf (w, vi ) > 0}|/|U | ∑ where occur (w, v) is the total number of occurrences of the word w in all tweets by v, and U is a set of randomly sampled Twitter users. By using tf (w, v) ∗ idf (w, u), we create a tf-idf vector for each follower v. In our implementation, however, for efficiency, we do not use all words appearing in tweets by the followers of u. For each ∑ u, we select 5,000 words that have larger values for ( v∈V (u) tf (w, v)) ∗ idf (w, u), i.e., the sum of the tf-idf values of w over all the followers of u. Next, we augment the vector above by using the information from the profile of the followers. We extract nouns from the profiles of all the followers of u by using morphological analyzer Mecab1 . We add dimensions for these words to the vector above (except when a word is already included in the 5,000 words above). For each of those nouns, if it appears in the profile of a follower v, we set the coordinate for the noun in the tf-idf vector for v to the average of tf-idf values within that vector. Finally, we compute a vector ⃗v (u) representing the interests of all the followers of u by taking the sum of the vectors of all the followers of u, and normalizing it so that its norm is 1. 1 Mecab, https://code.google.com/p/mecab/ d⃗ui = (occur(w1 , dui ), occur(w2 , dui ), . . .) √ occur(w1 , dui )2 + occur(w2 , dui )2 + · · · where w1 , w2 , . . . are the words used when producing ⃗v (u), and occur(wj , dui ) is the number of occurrences of the word wj in dui . We compute the score of dui for u, i.e., s(dui , u), by taking cosine similarity between ⃗v (u) and d⃗ui as defined below: s(dui , u) = ⃗v (u) · d⃗ui . V. S ELECTION P ROCESS In this section, we explain our four algorithms for selecting tweets. As explained in Section I, two algorithms are online algorithms that uses thresholds, and the other two algorithms are near-online algorithms that make decisions periodically. Our two online algorithms have the same basic structure. We compute a threshold, and whenever a tweet arrives, if its value exceeds the threshold, we retweet it. The two algorithms differ only in the method of deciding the threshold. The input of the algorithms is c, the upper limit of the number of tweets, and ⟨du1 , . . . , dun ⟩, the sequence of tweets posted by the friends during some time interval. The output is a sequence of selected tweets. A. History-Based Threshold Algorithm In the first method, we use the results of the past time intervals to define the threshold. Let Tz be the current time interval. We record the scores of tweets in the past k time intervals, i.e., Tz−k , . . . , Tz−1 , and for each Ti (z − k ≤ i ≤ z − 1), we compute the c-th largest score among those of the tweets in Ti . This value corresponds to the best threshold value for the time interval Ti , computed offline after we obtain all tweets in Ti . Let θ̂i denote such an “best offline threshold” for Ti . We compute θ̂z−k , . . . , θ̂z−1 , i.e., the best offline thresholds for the past k intervals, and use their average as the threshold for the next time interval Tz . Algorithm 1 shows this algorithm, which we call history-based threshold algorithm. This algorithm is based on the assumption that the optimal thresholds for consecutive time intervals do not largely change. B. Stochastic Threshold Algorithm In our second algorithm, we estimate the threshold in a probabilistic framework. In our definition, s(dui , u) is the similarity between the vector for a tweet and the vector for the followers. The latter, i.e., the vector for the followers, is fixed throughout the time interval. We also assume that the former, i.e., the vectors for tweets, are independent from each other. Therefore, we Algorithm 1 History-Based Threshold Algorithm i := 1 O := ∅ z−1 ∑ θ := θ̂i /k i=z−k while i ≤ n and |O| < c do if s(dui , u) ≥ θ then retweet dui O := O ∪ {dui } end if i := i + 1 end while assume s(dui , u) of tweets in D(u, T ) are i.i.d. (independent and identically distributed). To model s(dui , u), we use the beta distribution for the following reasons. 1) The score s(dui , u) is in the range [0, 1). Although s(dui , u) is defined as cosine similarity which is in the range [0, 1], s(dui , u) can never be 1 because the number of words included in a single tweet is less than 140, while the dimension of our vectors is larger than 5,000. Therefore, we should use some univariate probability distribution whose domain is [0, 1). 2) According to our survey, scores of tweets from the friends of a portal account u are not usually uniformly distributed, so we should not use the uniform distribution. The probability density function of the beta distribution is defined as follows: f (x; α, β) = B(α, β)−1 xα−1 (1 − x)β−1 where B(α, β) is a Beta function, which is defined by using Gamma function as follows: Γ(α)Γ(β) B(α, β) = . Γ(α + β) We estimate parameters α, β from the scores of the tweets that has arrived so far by using maximum likelihood estimation. Here we make the following assumptions: 1) the beta distribution can fit the scores of the tweets in the past, 2) scores of tweets in the current time interval are randomly sampled from the estimated distribution, 3) the upper bound of the scores of the tweets in the current interval T , denoted by smax (T ), is known, and 4) the number of incoming tweets in the current interval, i.e., |D(u, T )| can be estimated. Then we obtain: ∫ smax (T ) f (x; α, β)dx c θ = (1) µ |D(u, T )| where θ is the threshold we want to compute, and µ is the mean value of f (x; α, β). Let It (α, β) be the cumulative distribution function of the beta distribution. Then (1) can be written as: Ismax (T ) (α, β) − Iθ (α, β) c = µ |D(u, T )| and then we have θ = I −1 (Ismax (T ) (α, β) − µc ). |D(u, T )| (2) We can estimate the threshold θ by (2) only if we can estimate smax (T ) and |D(u, T )|. We estimate them in the following ways: 1) we estimate smax (T ) by the average of maximum scores in the last k time intervals, and 2) we estimate |D(u, T )| by the average of the number of incoming tweets in the last k time intervals. However, the assumptions we explained before do not necessarily hold, and the threshold computed by (2) includes errors for the following reasons. 1) Scores during the time interval T are biased sampling from the distribution f (x; α, β) which does not exactly fit the score distribution. 2) smax (T ) varies from time to time. 3) |D(u, T )| also varies from time to time. In order to reduce the error between θ computed by (2) for the current interval Tz and the real best threshold for Tz , we use the errors between them in the past intervals. We compute θz−k , . . . , θz−1 , i.e., thresholds estimated by (2) for the past k intervals Tz−k , . . . , Tz−1 , and also compute the best offline thresholds θ̂z−k , . . . , θ̂z−1 defined in the subsection V-A. Then we compute the average error in the past k time intervals by the formula below: ϵ= z−1 ∑ (θi − θ̂i )/k. i=z−k By using this ϵ, we define a revised threshold θ′ as below: θ′ = θ − ϵ. Our second algorithm use this θ′ as the threshold. We omit the formal definition of the algorithm because it is the same as the first algorithm except that we use θ′ as the threshold. C. Time-Interval Algorithm The remaining two algorithms select tweets periodically. The third algorithm does it in a fixed time interval. We divide the current time interval T into c sub-intervals. Notice that c is the number of tweets we select. In each subinterval, we select one tweet, and thus we select c tweets in total. At the beginning of each sub-interval, we create an empty buffer for storing tweets. During a sub-interval, we store all incoming tweets in that buffer, and at the end of the subinterval, we select a tweet that has the highest score among the tweets in the buffer. When c is large compared with the number of incoming tweets, or when the arrival of tweets are not uniformly Algorithm 2 Time-Interval Algorithm i := 1 O := ∅ B := ∅ subinterval := 1 while i ≤ n and |O| < c do if just passed the end of a sub-interval then k := subinterval − |O| retweet top-k tweets in B O := O ∪ {top k tweets in B} B := ∅ subinterval := subinterval + 1 end if B := B ∪ {dui } i := i + 1 end while distributed but are skewed to specific time period, we may have sub-intervals during which we have no incoming tweet. In such a case, we “carry over” the slot to the next sub-interval, and we select two tweets in the next sub-interval. If we have less than two tweets in the next sub-interval, we “carry over” the unused slots to the following sub-interval again. In the implementation, we do not need to store all incoming tweets, but need to store only n tweets that have the highest scores among the tweets that have arrived so far in the current sub-interval, where n is the number of “carry-over” plus 1. In the definition below, however, we simply store all the incoming tweets for simplicity of the definition. Formal definition of the third algorithm, which we call timeinterval algorithm, is shown in Algorithm 2. D. Every n-Tweets Algorithm The last algorithm also periodically makes a decision, but not at every fixed time interval. Instead, the fourth algorithm makes a decision at every fixed number of tweet arrivals. First, we estimate the number of incoming tweets in the current time interval |D(u, T )| by using the average number of tweets in the last k time intervals. Then we calculate n = ⌊|D(u, T )|/c⌋, and every time n tweets arrive, we select a tweet with the highest score among them. Similarly to the time-interval algorithm, we need a buffer to store tweets. In the fourth algorithm, however, we do not have “carry over”. Therefore, we only need to store at most one tweet. However, in the definition of the algorithm in Algorithm 3, we store all n tweets for simplicity of the definition. VI. E XPERIMENT A. Dataset Creation In this research, we focus on the recommendation of tweets for a portal account to retweet. In order to evaluate our four algorithms, we chose real portal accounts on Twitter, and created a dataset for simulating each of them. We first selected 20 portal accounts satisfying the following conditions: Algorithm 3 Every n-Tweets Algorithm i := 1 O := ∅ B := ∅ while i ≤ n and |O| < c do B := B ∪ {dui } if |B| ≥ n then retweet top tweet in B O := O ∪ {top tweet in B} B := ∅ end if i := i + 1 end while 1) It has more than 300 followers. 2) The account has been used for more than 2 years. 3) At least tweet and retweet 20 tweets a week. 4) More than 50% of its tweets are retweeted tweets. Then we selected 7 accounts from them. In order to preserve diversity, we classify these 20 portal accounts into 7 categories, and selected the most typical portal accounts from each of the 7 categories. The names of the selected 7 accounts are shown in Table I. For each of these portal accounts, we collected the tweets by their friends for more than two months (from Oct. 1, 2013 to Dec. 24, 2013) in order to simulate the activity of the portal accounts. In order to estimate the interests of their followers, we also collected the profiles of their followers, and collected tweets by the followers for more than two months (from Oct. 1, 2013 to Dec. 24, 2013). Because of the rate limit of Twitter REST API2 , we could not collect all the tweets by the followers for portal sites with a large number of followers, e.g., Microsoft Japan PR and Tokyo Government PR, which have 19776 and 65768 followers, respectively. In such cases, we randomly select at least 1,500 followers, and collect tweets by them. On the other hand, we did not apply sampling for the friends, even when a portal account has many friends. The number of the followers and the friends of each portal account are also summarized in Table I. Table I also shows the number of followers for which we actually collected tweets. Although we chose 1,500 users from the followers of each portal account, there was a large number of followers that are shared by more than one portal account, and as the result, the number of monitored followers is far larger than 1,500 for most accounts. Our dataset include 5,649 users in total and more than 180 million tweets. 5,649 is far smaller than the sum of the monitored followers of the seven portal accounts. It is because there were large duplication among them as explained above. B. Experimental Results By using the dataset explained above, we compared the performance of proposed four algorithms. A standard metric 2 REST 1.1 API Rate Limiting in v1.1 https://dev.twitter.com/docs/rate-limiting/ TABLE I S TATISTICS OF 7 P ORTAL S ITES U SED IN O UR E XPERIMENTS No. 1 2 3 4 5 6 7 User Name (English Translation) Microsoft Japan PR U. Tokyo Navi Tokyo Government PR Nara-city Official Twitter Kagawa Prefecture PR Kyoto U. COOP Travel Bureau Tokyo Parc Association PR Screen Name mskkpr Todai Navi tocho koho naracity tweets PrefKagawa travel id TokyoParks # of Followers 19776 4743 65768 1718 5854 330 1667 TABLE II T HE AVERAGE COMPETITIVE RATIO r(u, T ) OF F OUR A LGORITHMS WITH c = 6, 12, 24 c=6 8.54% 4.75% 68.47% 70.56% c = 12 19.35% 8.79% 65.22% 69.94% c = 24 31.69% 13.34% 63.00% 71.76% for the evaluation of online algorithms is competitive ratio, which is the ratio between the performance of an online algorithm and the performance of the optimal offline algorithm. We compute competitive ratio for a portal u at a time interval T by the formula below: ∑ d∈D ′ (u,T ) s(d, u) r(u, T ) = ∑ d∈D̂ ′ (u,T ) s(d, u) where D′ (u, T ) is the the sub-sequence of tweets selected by some online algorithm, while D̂′ (u, T ) is the best answer computed by an offline algorithm. In other words, D̂′ (u, T ) is the top-c tweets posted by the friends of u during T . r(u, T ) represents the ratio of total value of tweets selected by an online algorithm and by the optimal offline algorithm. We run our four algorithms for the 7 portal sites for 75 days with c = 6, 12, 24 and the length of the interval T is 24 hours. In other words, we need to recommend 6, 12, or 24 tweets every 24 hours. The average values of r(u, T ) over 75 days for each algorithm and for each of c = 6, 12, 24 are summarized in Table II and Figure 1. As shown in the table and the graph, the performance of two online algorithms are low while the performance of two near-online algorithms are high. The performance of Stochastic Threshold algorithm is always the lowest, and the performance of Every n-Tweets algorithm is always the highest. We omit the detailed results in each of 7 ∗ 3 = 21 cases for 7 sites and c = 6, 12, 24 for the sake of space limitation, but n-Tweets algorithm shows the highest performance for 15 cases among the 21 cases, and for the other 6 cases, Time-Interval algorithm shows the highest performance. Their differences are, however, not large in all these 21 cases. Time-Interval algorithm and Every n-Tweets algorithm, however, have one serious disadvantage: time delay of retweeting. We calculated the average delay of retweeting for each algorithm with c = 6, 12, 24. The results are summarized in # of Friends 57 29 153 28 30 145 57 corrected avg(r(u, T )) Average of r(u, T ) History-Based Threshold Stochastic Threshold Time-Interval Every n-Tweets # of Monitored Followers 3091 2415 2497 1718 1749 330 1667 History-Based Stochastic Time-Interval Every n-Tweets c=6 c = 12 c = 24 Fig. 1. Average competitive ratio r(u, T ) of the algorithms with c = 6, 12, 24 TABLE III AVERAGE D ELAY OF F OUR A LGORITHMS WITH c = 6, 12, 24 Delay (sec) History-Based Threshold Stochastic Threshold Time-Interval Every n-Tweets c=6 0 0 7051 6084 c = 12 0 0 3609 2716 c = 24 0 0 1946 1198 Table III and Figure 2. The delay is almost 0 for two online algorithms. The delay for Time-Interval algorithm for c = 6 is about two hours. This is because we retweet one tweet for every 24/6 = 4 hours, and the selected tweet is two hours old in average. When c = 12 and c = 24, the delay for Time-Interval algorithm is about one hour and 30 minutes for the same reason. On the other hand, Every n-Tweets algorithm has the average delay longer than the Time-Interval algorithm for c = 6, but slightly shorter for c = 12 and c = 24. Unlike the Time-Interval algorithm, the Every n-Tweets algorithm does not have the upper bound of the delay (except for the end of the current interval T ). Suppose |D(u, T )| = 96 and c = 24. Then we retweet every time we receive 4 tweets. In this case, even if we have three tweets in the buffer, if the fourth tweet do not arrive for long time, no tweet is retweeted for a long time. This does not happen in Time-Interval algorithm as long as we have at least one tweet in the buffer. However, these experimental results show that the n-Tweets algorithm does not cause long delays even when c is small, and produce even shorter delay than Time-Interval algorithm when c is large. In order to determine how long delays can be acceptable for ordinary followers of the portals, we calculated time lags between the arrival and retweeting for all tweets retweeted by sec corrected c=6 c = 12 History-Based Stochastic Time-Interval Every n-Tweets c = 24 Fig. 2. Average delay of the algorithms with c = 6, 12, 24 TABLE IV D ELAY IN 7 R EAL P ORTAL ACCOUNTS ON T WITTER No. 1 2 3 4 5 6 7 average (sec) 8982 39966 13180 33069 10524 63665 16458 max (sec) 1121476 1491408 74051 611051 243247 578893 265742 min (sec) 22 26 258 88 33 117 98 median (sec) 1560 5950 4775 10747 2625 60075 3098 our 7 portal accounts. The result is summarized in Table IV. Compared with the average delays in these manually managed portal accounts, the delays in our automated system seems acceptable even with near-online algorithms. Based on our experimental results and the discussion above, our conclusion is that Every n-Tweets algorithm is the best choice for our purpose. Its advantages are as follows: 1) It shows the highest selection quality which is slightly better than that of Time-Interval algorithm. 2) It causes only slightly longer delays than Time-Interval algorithm even in the worst case, i.e., even when c is small. On the other hand, the two real-time algorithms did not work well. The reason why Stochastic Threshold algorithm did not work well could be the lack of enough information on the distribution of tweet scores. VII. C ONCLUSION In this paper, we proposed four tweet selection algorithms for retweet recommendation systems. Our experimental results with real Twitter data shows that Every n-Tweets algorithm is the best because it achieves the highest selection quality only with acceptable delays. Our algorithms are also applicable to other stream media, such as RSS. There are several directions for future research. Instead of specifying a rigid upper limit, we can give some penalty to retweets exceeding the limit. In such a setting, even when a portal has retweeted the given number of tweets, it can retweet additional tweets if some of the following tweets are really important. Another interesting direction is dynamically changing the limit depending on the time in a day. For example, we can set the limst to be a small value during the daytime when the users are at work, while we can set it to be a larger value durint the evening. R EFERENCES [1] “Twitter statistics,” http://www.statisticbrain.com/twitter-statistics/, Jan. 2014. [2] M. Rosoff, “Twitter has 100 million active users – and 40% are just watching,” http://www.businessinsider.com/ twitter-ceo-dick-costolo-2011-9, Sept. 2011. [3] R. Maestre and F. Tapia, “Information propagation in twitters network,” http://labs.paradigmatecnologico.com/2011/07/06/ information-propagation-in-twitters-network/, Jul. 2011. [4] J. Fritz, “The life cycle of a tweet and facebook post still worth it?” http://nonprofit.about.com/b/2011/08/31/ the-life-cycle-of-a-tweet-and-facebook-post-still-worth-it.htm, 2011. [5] J. Nielsen, “Writing for social media: Usability of corporate content distributed through facebook, twitter & linkedin,” http://www.nngroup. com/articles/writing-social-media-facebook-twitter/, Oct. 2009. [6] S. M. Kywe, E.-P. Lim, and F. Zhu, “A survey of recommender systems in Twitter,” in Proc. of SocInfo, 2012, pp. 420–433. [7] I. Uysal and W. B. Croft, “User oriented tweet ranking: A filtering approach to microblogs,” in Proc of CIKM, 2011, pp. 2261–2264. [8] P. Nasirifard and C. Hayes, “Tadvise: A Twitter assistant based on twitter lists,” in Prof. of SocInfo, 2011, pp. 153–160. [9] S. Wang, X. Zhou, Z. Wang, and M. Zhang, “Please spread: Recommending tweets for retweeting with implicit feedback,” in Proc. of DUBMMSM, 2012, pp. 19–22. [10] B. Suh, L. Hong, P. Pirolli, and E. H. Chi, “Want to be retweeted? large scale analytics on factors impacting retweet in Twitter network,” in Proc. of SocialCom/PASSAT. IEEE, 2010, pp. 177–184. [11] Z. Yang, J. Guo, K. Cai, J. Tang, J. Li, L. Zhang, and Z. Su, “Understanding retweeting behaviors in social networks,” in Proc. of CIKM. ACM, 2010, pp. 1633–1636. [12] S. A. Macskassy and M. Michelson, “Why do people retweet? Antihomophily wins the day!” in Proc. of ICWSM, 2011, pp. 209–216. [13] H.-K. Peng, J. Zhu, D. Piao, R. Yan, and Y. Zhang, “Retweet modeling using conditional random fields,” in Proc. of ICDMW, 2011, pp. 336– 343. [14] Z. Xu and Q. Yang, “Analyzing user retweet behavior on twitter,” in Proc. of ASONAM 2012. IEEE, 2012, pp. 46–50. [15] R. Hochreiter and C. Waldhauser, “A stochastic simulation of the decision to retweet,” in Proc. of ADT, 2013, pp. 221–229. [16] H. He, Z. Yu, B. Guo, X. Lu, and J. Tian, “Tree-based mining for discovering patterns of reposting behavior in microblog,” in Proc. of ADMA (1), 2013, pp. 372–384. [17] X.-T. Vu, P. Morizet-Mahoudeaux, and M.-H. Abel, “Empowering collaborative intelligence by the use of user-centered social network aggregation,” in Proc. of Web Intelligence, 2013, pp. 425–430. [18] R. D. Kleinberg, “A multiple-choice secretary algorithm with applications to online auctions,” in Proc. of SODA, 2005, pp. 630–631. [19] G. S. Lueker, “Average-case analysis of off-line and on-line knapsack problems,” in Proc. of SODA, 1995, pp. 179–188. [20] A. Marchetti-Spaccamela and C. Vercellis, “Stochastic on-line knapsack problems,” Mathematical Programming, vol. 68, no. 1-3, pp. 73–104, 1995. [21] Y. Song, H. Wang, Z. Wang, H. Li, and W. Chen, “Short text conceptualization using a probabilistic knowledgebase,” in Proc. of IJCAI Vol.3, 2011, pp. 2330–2336. [22] D. Kuseta, “The average tweet length is 28 characters long, and other interesting facts,” http://www.smk.net.au/article/ the-average-tweet-length-is-28-characters-long-and-other-interesting-facts, Nov. 2012. [23] H. Kellerer, et al. Knapsack Problems. Springer, 2004. [24] C. M. Bishop, Pattern Recognition and Machine Learning. Springer, 2006.

Download PDF

advertisement