The is-xhuang1994 from tapilab

Problem

The task is to develop a classification model to Distinguish bots from humans on Twitter based on available data. Bots, are users on social media (e.g., Twitter) that are manipulated by computer programs. Bots can post, retweet, and even reply automatically. They are usually created for some purpose like hyping, advertising/promoting, and news/rumors spreading.

Related work

On this topic several papers are found that uses the same method: find and collect numeric features of each user, and feed the data to a classification algorithm. Highest classfication accuracy result have been reported is around 99%.

Related papers:
Who is Tweeting on Twitter: Human, Bot, or Cyborg?
Detecting and Analyzing Automated Activity on Twitter
Detecting Spammers on Twitter
Uncovering Social Spammers: Social Honeypots + Machine Learning
A Long-Term Study of Content Polluters on Twitter

Data source

I used the social honeypot dataset[1] in the beginning. The classification model was developed based on papers I read and tested with data from honeypot dataset. My initial model uses Random Forest Classifier with default parameters from sk-learn. The cross validation result is:

Figure 1: Cross validation results with old data of old users

As is suggested by Dr. Culotta, the performance is a little "too good" and it is mentioned in the paper that there are some threshold used in collecting users so it could make the result not generalizable. To reveal the problem I tried plotting graphs with the user data. Here's some funny graphs I've got:

Figure 2: Funny graphs based on old data of old users

As you can see there's a clear boundry for the ratio of # followers on # followings (about 0.9), which doesn't make any sense unless it is a threshold used in collecting users and their data. So the data we've got in social honeypot dataset is highly biased.

To prove this I collected the most recent data of same users and tested my model with it. The cross validation result and the graph both become a little bit "normal" this time:

Figure 3: Cross validation results with new data of old users Figure 4: Graphs based on new_data of old users.

The performace is a little bit worse which is expected. And in the graph some of the users appear below the boundry, but it still looks biased as most of the users are still above the boundry.

After I realize the problem I "randomly" collected 1100 users on Twitter through Streaming API – the first 1100 who posted anything after I started collecting process were collected. It's not completely random but we might still draw some generalizable conclusions from it. I then spent a few weeks to manually label 523 of the users, in which 54 are bots and 469 are humans. I also plotted the same graph as above and now it looks completely unbiased:

Figure 5: Graph based on new_users

Features

For each user, 45 features were extracted from available data, the 22 most important features are listed below (ordered by importance)

 Number of replies over number of posts

 3-gram Jaccard similarity between each two tweets

 Number of retweets over number of posts

 Average number of time each tweet is retweeted

 Number of tweets posted on Sundays

 Number of tweets posted on Saturdays

 1-gram Jaccard similarity between each two tweets

 Number of statuses over number of followers

 Number of friends over number of followers

 Number of unique mentions over number of posts

 Number of hashtags over number of posts

 Number of friends

 Number of statuses over number of friends

 Number of statuses

 2-gram Jaccard similarity between each two tweets

 Number of tweets posted on Fridays

 Length of user description

 Number of unique urls over number of tweets

 Number of mentions over number tweets

 Number of tweets posted on Saturdays

 Length of username

 Ratio of unique hashtags over tweets

Evaluation

Performance Measuring

I used precision and recall of bots to measure the performance because the classes are imbalanced. I found that I can’t guarantee both precision and recall at same time. There’s always a trade-off between them. But by defining a false negative cost for bots, I can guarantee either precision or recall with the following decision formulas: (p is the probability of being bot) p * cost > 1 – p => Bot p * cost < 1 – p => Human In this case, by varying the cost of bots, we can have a high precision or a high recall. If someone wants to use my model to find and analyze some bots, he would like to have a high precision, so the users returned are very likely to be bots. If someone wants to use my model to find as many bots as possible from a set of users, he would like to have a high recall. The precision would not be very high and he still needs to identify each user, but the workload would be reduced a lot. Since the precision and recall changes with bots-cost, I have to plot graph of precision and recall to show the model performance.

Classifiers

Result by Logistic Regression with penalty = L2, C = 1, and bots_cost = 1:

Precision: 0.7540 Recall: 0.5080 AUC of ROC: 0.7454

Result by Random Forest Classifier with default parameters and bots_cost = 1:

Precision: 0.7587 Recall: 0.5482 AUC of ROC: 0.7654

It turns out that Random Forest Classifier has better performance than Logistic Regression. The change of precision and recall with max_depth is shown in Figure 6. The best value for max_depth, according to the figure, is 7.

Figure 6: Precision, recall, and AUC of ROC vs. max_depth

The change of precision and recall with min_samples_leaf is shown in Figure 7. The best value for min_samples_leaf, according to the figure, is 6.

Figure 7: Precision, recall, and AUC of ROC vs. min_samples_leaf

So the classification model I selected is Random Forest Classifier with max_depth = 7 and min_samples_leaf = 6.

Performance

I used 4-fold cross-validation with different bots_cost to show the performance of my model. The result of my selected classification model is shown in Figure 8. It turns out that precision or recall can be very close to or even equal to 1, by varying the bots_cost. As bots_cost increases, precision of bots decreases and recall of bots increases, dramatically. I think it proves that my classification model is effective.

Figure 8: Precision, recall, and AUC of ROC vs. bots_cost using selected classification model.

Something Funny

We know that bots and humans are distinguished by their behaviors (e.g., autonomous behavior indicates bots). But to build a classification model, there may not always be a good feature to represent a behavior. If a user is a phony fan, it tends to have many more followers than friends. This behavior can be represented as number of followers over number of friends. But if a user keeps sharing articles from same website and also post some words that are duplicated (exactly same words can be found from elsewhere), it might be considered as a bot. But such behavior cannot be easily represented by a feature. That sometimes makes it hard to distinguish bots from humans.

Here's something interesting: some users may have very normal stats and normal actions like human users, but I labeled them as bots for some other reason, which can’t be well represented by any feature. For example, some users have used language processing technique:

Figures 9-10: Examples of users using language processing technique

In the first two figures the user is clearly a sharer bot that uses language processing, by replacing words with other alternatives. In the second figure the user add pound signs before some words to make them hashtags, which don't make any sense unless it's autonomous behavior.

Conclusion

From the result I can conclude that I have a right direction about the problem, though my result is not very good. Some bots are very hard to be distinguished from humans by my model, and some users I still have trouble labeling them. Yet my model does work for some special cases. It can achieve either high precision or high recall. So I think I’m in the right direction.

There are still a lot of things can be done to improve the performance, such as digging more good features and maybe do some analysis on the relation graph of the users. Also I can expand my data – I only have 523 instances for training because I can’t find a good dataset for this topic.

References

[1] K. Lee, B. Eoff, and J. Caverlee. Seven Months with the Devils: A Long-Term Study of Content Polluters on Twitter. In Proceeding of the 5th International AAAI Conference on Weblogs and Social Media (ICWSM), Barcelona, July 2011. (Bibtex)

docs/RelatedWork.md

Use Google scholar to find 10 or so papers related to:
"fraud in social media" "bot detection" "fraudulent users"

Write a brief description
- What is the problem?
- What data do they collect?
- What method do they use?

Some conferences:

ICWSM
KDD
AAAI
WWW
Description

Social media bots have been around nowadays. They are designed to act like humans - automatically tweeting, retweeting, and responding to other's posts. In addition to bots, there has emerged cyborg referred to either bot-assisted human or human-assisted bot. They are usually maintained by some companies or websites as a media channel and customer service. Most of bots are designed to provide particular services, but they can be very harmful. They can mislead, exploit, and manipulate social media discourse with rumors, spam, malware, misinformation, slander, or even just noise.

What is the problem?

Twitter bots have become increasingly intelligent and complicated, making their detection more difficult. For example, social bots can search the Web for information and media to fill their profiles, and post collected material at predetermined times, emulating the human temporal signature of content production and consumption—including circadian patterns of daily activity and temporal spikes of information generation.They can even engage in more complex types of interactions, such as entertaining conversations with other people, commenting on their posts, and answering their questions. To acquire visibility, they can infiltrate popular discussions, generating topically-appropriat —and even potentially interesting— content, by identifying relevant keywords and searching online for information fitting that conversation. They can also automatically produce responses through natural language algorithms

What data do they collect?

Relationship graph of users.
Particular features of users, such as number of tweets, number of friends & followers, temporal features of tweets, tweeting devices used, etc.

What method do they use?

Examining the structure of a social graph. SybilRank for example assumes that sybil accounts exhibit a small number of links to legitimate users, instead connecting mostly to other sybils, as they need a large number of social ties to appear trustworthy. Some detection methods can be used to reveal such tightly-knit local communities. This kind of methods is not very powerful since attackers can mimic legitimate accounts community structure. As graph goes large, things can be much more complicated.

Crowd-sourcing bot detection. Given profiles of some users, it is observed that a group of workers can distinguish bots from legitimate users at a very high accuracy. This method is not cost-effect for a platform with a large pre-existing user base like Tweeter and Facebook. Also it involves some privacy issues.

Feature-based bot detection. Implements a detection algorithm relying upon several highly-predictive features which capture a variety of suspicious behaviors and separate social bots from humans.

tapilab / is-xhuang1994 Goto Github PK

is-xhuang1994's Introduction

Problem

Related work

Data source

Features

Evaluation

Performance Measuring

Classifiers

Performance

Something Funny

Conclusion

References

is-xhuang1994's People

Contributors

Stargazers

Watchers

Forkers

is-xhuang1994's Issues

Recommend Projects

Recommend Topics

Recommend Org