Giter Site home page Giter Site logo

muanis / readlikeyoutweet Goto Github PK

View Code? Open in Web Editor NEW

This project forked from karstenkreis/readlikeyoutweet

0.0 0.0 0.0 7.93 MB

Recommending News Articles to Twitter Users based on their Tweets

Python 38.91% CSS 50.31% JavaScript 1.24% HTML 9.54%

readlikeyoutweet's Introduction

Read Like You Tweet

A New York Times Article Recommendation System Based On Your Twitter Timeline

Author: Karsten Kreis

Overview (a very detailed description of the recommender can be found in underthehood.ipynb)

This project started as my final project for General Assembly's Data Science class in New York City in summer 2015. The idea is to recommend New York Times articles to Twitter users based on their tweets. This is established in the following way:

  • I downloaded over 100.000 article snippets from the New York Times Article Search API and categorized them according to their sections
  • I vectorized the text and created text features with a term frequency-inverse document frequency vectorizer
  • I trained a multiclass Logistic Regression classifier to identify the classes

The above happened "offline". Similarly as the words in an article indicate the section it belongs to, the same words in tweets are likely to indicate that the Twitter user is interested in news from this section. Therefore, the obtained model can be used to predict a Twitter user's interests.

The program/website does the following:

  • A Twitter user provides her or his Twitter handle
  • With the Twitter API the 100 latest tweets are downloaded
  • These tweets are processed and vectorized as the article data before and feeded into the Logistic Regression model
  • This should, hopefully, yield the category the user may want to read news from

The final step:

  • Connect to the New York Times Top Stories API
  • Fetch the top story articles from the section which was predicted by the classifier. This usually yields 30 articles from this section
  • Calculate the Jaccard distance between these articles and the user's tweets
  • Recommend the closest article to the Twitter user

Possible further modifications

There are many possible improvements and extensions:

  • Try to fit a stronger model, possibly using other classifiers
  • Use dimensionality reduction or clustering techniques to gain further insights and/or reduce features
  • Predict several probable labels and do not only recommend from one section but from several probable ones
  • Scrape whole articles using webscraping tools like beautifulsoup to get whole articles instead of only headlines, snippets and keywords. This could maybe help when training the algorithm and when calculating the Jaccard distances
  • Include further newspapers other than the New York Times, both for model training as well as recommendation (use for example also the Guardian, which also has a great API framework)
  • Check where the user comes from (UK, US, Australia) and recommend either from NYT/Guardian US, Guardian UK, or Guardian Australia
  • Extend the system beyond targeting only English twitterers and recommending only English newspaper articles
  • Try to get even more user information, for example from Facebook, LinkedIn, etc., to make even better recommendations

Files

Note that I did not upload the actual datasets, the pickled logistic regression model, the pickled tfidf vectorizer and the pickled stopwords (for the website also the stopwords need to be pickled). However, with the code the data can be downloaded again and the models parametrized again.

Furthermore, note that the whole code naturally requires API keys for all involved APIs to work.

readlikeyoutweet's People

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.