Copied from old repo:
The trick to sites like Reddit and Hacker News is that they have a good algorithm for ranking content:
http://amix.dk/blog/post/19574
http://amix.dk/blog/post/19588
I would like to develop an algorithm for NeoCities site ranking, in particular for the browse page: http://neocities.org/browse.
I'm not sure what the best approach is here, and to be honest, I really suck at algorithms. We have a very unique situation, and solving it will require a pretty unique algorithm. Also, there's a lot of subjectivity here. So I wanted to post this and see if anyone in the community could come up with a default browse algorithm that would work well for NeoCities web site browsing/highlighting.
Currently, we just show the last updated sites. It's not ideal because it punishes sites that don't update all the time (it rewards quantity over quality), and there are already a few people that are gaming it using a script that regularly re-uploads the same HTML to the site. I expect the gaming problem to get progressively worse as the site expands.
It's a bit different than Reddit and Hacker News, in the context that posts there are time-based, and slowly disappear from the front page based on how old they are. We need to come up with an algorithm that doesn't neccessarily punish sites for being old (on the contrary, we want to better promote sites that aren't just "flashes in the pan"), shows sites that are trending in popularity, but is sufficiently dynamic that it doesn't give them a permanent placement on the top so that other sites have a chance to be seen too.
I have the following information I can potentially derive from:
- When site was created
- When site was updated
- How big the index.html is
- How much disk space the site is using
- How many hits the site has gotten
- How many files the site has
- Whether the site provided an email on creation or not
- When files were created
- When files were updated
Longer term, more information will be available. We are currently working to build out more social functionality on the site, which will allow users to follow/like sites, comment on changes, and make it much easier to share things. It will add this information to the pool:
- Number of follows a site has
- The date those follows occurred at
- How many comments a site change has received
- How many likes a site update has gotten, and when those likes occurred
- How many times a site has been shared
- How many tags a site has
- How popular those tags are
- The last time a site user logged in
Ideas/formulas welcome. CCing @amix on this because this is clearly something he thinks about a lot, and could potentially have some advice/pointers/formulas/book-recommendations to contribute!
To make things interesting, I'm offering a 0.5 BTC bounty (or the same amount in USD via Paypal or Venmo) for any user that contributes an algorithm that becomes our default algorithm (the "special sauce" sorting). You will also receive credit for developing the algorithm somewhere, perhaps even on the bottom of the browse page.
And of course, the algorithm implementation will be open source because the site is open source. Feel free to send pull requests if you'd prefer to do it that way, though it probably makes more sense to start with a formula. Thanks!
RESPONSES:
@noelwelsh:
This is going to be a bit of a drive-by comment because I don't have a great deal of time. Apologies in advance.
Firstly, I think you need to define the problem more precisely. I would define it thus: "the algorithm should return a list of sites ordered according to the likelihood of the user finding them engaging." Very quickly you need to decide if you are personalizing this list per-user or not. I'm assuming you're not.
With this definition we can go places because:
- There is a huge literature on measuring engagement; and
- There is a huge literature on generating rankings.
For the former we can start by looking here: http://www.dcs.gla.ac.uk/~mounia/ (There's lots more work in the field. I just happen to have some familiarity with her work.) For the latter searching for "submodular diversity ranking" will return a lot of papers with complicated algorithms. Yay!
The number one problem I see is that you don't have sufficient information to get a useful measurement of engagement. If you don't have useful feedback you can't learn if what you're suggesting is any good. At the minimum you want to measure clickthrough. Better would be time-on-page (possibly normalised by page length). Solve this problem, even poorly, and we can them look at using this to generate rankings.
As for generating rankings, while the submodular optimisation approach has some very nice properties (e.g. it maintains diversity in the suggestions) there are simpler methods to start with. Let's say we have some measure of engagement. We can put a confidence interval on this measure using standard maths. Then just display the sites that have the higher upper bound on the confidence interval. You should decay the measure over time, widening the confidence interval, to allow for changing interests.
Hope that's useful.
@robsheldon:
I left a comment on HN (https://news.ycombinator.com/item?id=7538491) with a straightforward, naive approach that should work OK. Includes pseudocode.
@amix:
I don't have that much time to devise an algorithm, but I would recommend reading "Programming Collective Intelligence": http://shop.oreilly.com/product/9780596529321.do It has a ton of great examples that you can get inspired by.
I would also recommend keeping the algorithm simple and testing it on real data. If you look at both Reddit's and HN's algorithm they are really, really simple, but quite effective.
These are my 2 cents, sorry that I could not be of a bigger help.