Giter Site home page Giter Site logo

social-protocols / news Goto Github PK

View Code? Open in Web Editor NEW
43.0 3.0 4.0 1.22 MB

Quality News - Towards a fairer ranking formula for Hacker News

Home Page: https://news.social-protocols.org

License: Apache License 2.0

Go 99.12% Nix 0.27% Shell 0.61%
hacker-news news-aggregator hacker-news-reader

news's Introduction

Towards a fairer ranking formula for Hacker News

GitHub

Quality News is a Hacker News client that provides additional data and insights on submissions, notably, the upvoteRate metric. The metric is the result of our collection and analysis of the Hacker News voting data set. We propose that this metric could be used to improve the Hacker News ranking algorithm.

Quality News uses live minute-by-minute rank and upvote data collected from Hacker News. It looks and behaves very similar to the original Hacker News site except it shows upvoteRate and other metrics next to each story. It also provides charts with the history of each story's rank, upvotes, and upvote rate. It is a lightweight, server-side rendered page written in go and hosted on fly.io.

We hope for interest and support from the Hacker News community to encourage official ranking experiments on the Hacker News frontpage. We're happy to help where we can in this regard.

This is a collective intelligence experiment by Social Protocols. Follow us on Mastodon or Twitter, or send a mail.

Introduction

The Problem

The success of a story on HN is partly a matter of timing and luck. A few early upvotes can catapult a new story to the front page where it can get caught in a feedback loop of even more upvotes.

graph LR
    R(Higher Rank)
    U(More Upvotes)
    R --> U
    U --> R

As a result, the HN ranking algorithm is not consistent: the same story submitted multiple times can receive vastly different numbers of upvotes every time. For example, this story was submitted 3 times over the course of two days, receiving 6 upvotes in total for the first two submissions, and 437 upvotes for the third.

Nor are upvote counts necessarily comparable. The best submissions don't always get caught in the positive feedback loop. So the stories that make the front page may not reflect the aggregate intent of Hacker News community-members as revealed by their upvote behavior. We discuss this problem further in our article Improving the Hacker News Ranking Algorithm.

The Solution

Quality News aims to solve this problem with a new metric: upvoteRate. upvoteRate quantifies how much more or less likely users are to upvote a story compared to the average story regardless of:

  • the time/day of week a story was submitted
  • overall amount of traffic to the site
  • whether the story gets caught in a positive rank-upvote feedback loop

We believe that if the HN Ranking Formula were based on upvoteRate instead of upvotes, the stories on the front page might better reflect the intent of the community.

The Current Ranking Formula

This is the current Hacker News ranking formula:

 rankingScore = pow(upvotes, 0.8) / pow(ageHours + 2, 1.8)

The ranking on the front page is created by evaluating the formula for each story and arranging the stories according to their rankingScore. Note that this formula does not take the rank or timing of individual upvotes into account. So a story that receives 100 upvotes at rank 1 is valued the same as one that receives 100 upvotes at rank 30, even though a story on rank 1 is seen by more users than a story on rank 30. And upvotes received during peak hours are treated the same as upvotes received during period of low traffic. This makes upvotes an unreliable measure of the community interest in a story.

To solve this problem, the new upvoteRate metric should be adjusted for rank and timing, effectively giving upvotes received at high ranks and peak times less weight, eliminating the positive feedback loop.

Upvote Share by Rank

We start by looking at historical upvote data on Hacker News for each rank and page type: top (front page), new, best, ask, and show. We obtained this data by crawling the Hacker News API every minute for several months, and recording each story's rank and score (upvote count). The change in score tells us approximately how many upvotes occured at that rank during that time interval.

We then calculated the share of overall site-wide upvotes that occur at each rank. For example, the first story on the top page receives on average about 10.2% of all upvotes (about 1.169 upvotes per minute), whereas the 40th story on the new page receives about 0.05% (about 0.0055 upvotes per minute). Upvote shares for the top page are summarized in the chart below.

topRank avgUpvotes avgUpvoteShare
1 1.169 10.2%
2 0.698 6.1%
3 0.538 4.7%
... ...
10 0.274 2.4%
... ...
40 0.043 0.4%
... ...
80 0.013 0.1%
TOTAL 11.493 100%

Expected Upvotes Over Time

If we multiply the average upvote share for a rank by the total site-wide upvotes during some specific time interval, we get the number of expected upvotes for that rank and time interval. Or to be more precise, we get the number of upvotes we would expect the average story to receive at that rank during that time interval.

expectedUpvotes[rank, timeInterval]
    = avgUpvoteShare[rank] * sidewideUpvotes[timeInterval]

For example, during the minute starting 2023-02-07 11:08:00Z, there were a total of 9 upvotes on Hacker News. Rank 1 is expected to receive 10.2% of sitewide upvotes. So, regardless of how many upvotes the story at rank 1 actually received during that minute, the expected upvotes for a story at rank 1 during that minute is:

expectedUpvotes[1, 2023-02-07 11:08]
    = avgUpvoteShare[1] * sidewideUpvotes[2023-02-07 11:08]
    = .102 * 9
    = .918

Given a history of the story's rank over time, we can compute its total expected upvotes:

totalExpectedUpvotes for a story =
  sum(for each timeInterval in the history of that story) expectedUpvotes[rank of story at that timeInterval, timeInterval]

For example, suppose the story at rank 1 also spent the following minute there, during which time interval there were an additional 5 site-wide upvotes. So far, totalExpectedUpvotes for the story is:

  expectedUpvotes[1, 2023-02-07 11:08] + expectedUpvotes[1, time 2023-02-07 11:09] + ...
  = .102 * 9 + .102 * 5
  = 1.428

Computing this sum over the story's entire history on Hacker News gives the story's totalExpectedUpvotes. Below is a sample chart showing the actual rank history for one story, and the corresponding values of totalExpectedUpvotes vs actual upvotes. Note that this particular story consistently received far more upvotes than expected. You can see these charts for any story by clicking on the ×upvoteRate next to the story on Quality News.

Chart of Rank, Upvotes and Expected Upvotes for a Sample Story

sample stories rank history chart

sample expected upvotes chart


History of rank, upvotes, and expected upvotes for a sample story

The Upvote Rate

The above story consistently received more upvotes than expected. Other stories consistently receive less.

The factor of how many more or fewer upvotes a story receives than expected is called its upvoteRate. During each time interval, each story should receive, on average, the expectedUpvotes for the rank it was shown at times the story's upvoteRate:

upvotes[timeInterval]
    ≈ upvoteRate * expectedUpvotes[rank[timeInterval], timeInterval]
Chart of Upvote Rate for a Sample Story

sample upvote rate chart


History of the estimated upvote rate of a sample story

The relationship upvotes ≈ upvoteRate * expectedUpvotes holds even in the aggregate, independently of the ranks at which upvotes actually occurred. This can be seen by taking the sum of a story's upvotes across all time intervals:

totalUpvotes = sum{for each timeInterval} upvotes[timeInterval]
             = sum{for each timeInterval} upvoteRate * expectedUpvotes[rank[timeInterval], timeInterval]
             ≈ upvoteRate * sum{for each timeInterval} expectedUpvotes[rank[timeInterval], timeInterval]
             ≈ upvoteRate * totalExpectedUpvotes

This means we can estimate upvoteRate simply by dividing the story's total upvotes by its total expected upvotes:

upvoteRate ≈ totalUpvotes / totalExpectedUpvotes

We call this estimate the observed upvote rate.

We can better estimate the true upvote rate by adjusting the observed upvote rate to account for small sample sizes, and a phenomenon we call fatigue. See more details under Bayesian Averaging and Fatigue below.

A Proposed New Formula

The following is a proposed alternative Hacker News ranking formula based on estimatedUpvoteRate.

newRankingScore
    = pow(age * estimatedUpvoteRate, 0.8) / pow(age + 2, 1.8)

This formula simply substitutes age * estimatedUpvoteRate for upvotes in the original Hacker News ranking formula. In general, upvotes is roughly proportional to age * estimatedUpvoteRate, and the rankings are not effected by multiplying upvotes by a constant factor, so this formula approximates the ranking a story should have given its upvote rate and age.

Unfortunately, the actual Hacker News ranking score is further adjusted based on various factors including flags, moderator penalties and bonuses, domain, and number of comments. Although we know or can guess about some of these factors, data on flags and moderator actions are not publicly available. This means we can't actually reproduce the Hacker News ranking score, and therefore cannot modify it to show what the Hacker News front page would look like using our new formula.

Although we have made some attempts to automatically infer penalties and bonuses based on differences between a story's actual rank and raw rank, the results have been questionable so far.

As a logical next step, we suggest that the Hacker News team experiments with the new formula on an alternative front page. This will allow the community to evaluate whether the new ranking formula is an improvement over the current one.

Development

The application is a single Go process that crawls the Hacker News API every minute. For each story, it records the current rank and page (top, new, best, etc.), and how many upvotes it has received, computes the expected upvotes share for that rank and updates the accumulated expected upvotes for that story. The data is stored in a Sqlite database.

The frontpage generator queries the database and calculates the Bayesian average upvote rate in the SQL query. It then uses the Go templating library to generate HTML that mimics the original HN site. The frontpage is regenerated every minute and served directly from memory.

Running it locally

Make sure, you have:

  • go 1.19+
  • direnv - to set environment variables automatically
  • entr - to automatically rerun server when files change
  • sqlite3

Run:

source .envrc # if you don't have direnv installed

go get

Then:

go run *.go

Or, to automatically watch for source file changes:

./watch.sh

Using NIX

There is also a shell.nix available, which provides all required dependencies.

Install nix on your system, enter the news directory, and run:

nix-channel --update
nix-shell
./watch.sh

Contributions

All contributions are welcome! Please open issues and PRs.

Appendix: Refinements to Expected Upvotes Estimate

Bayesian Averaging

If we don't have a lot of data for a story, the observed upvote rate may not be a very accurate estimate of the true upvote rate.

A more sophisticated approach uses Bayesian inference: given our prior knowledge about the distribution of upvote rates, plus the evidence we have observed about this particular story, what does Bayes' rule tell us is the most probable true upvote rate?

Since there are infinitely many possible true upvote rates, we can't use a trivial application of Bayes rule. But we can estimate the most likely true upvote rate using a technique called Bayesian Averaging. Here is a good explanation of this technique from Evan Miller.

The Bayesian Averaging formula in our case is:

estimatedUpvoteRate ≈ (totalUpvotes + strengthOfPrior) / (totalExpectedUpvotes + strengthOfPrior)

Where strengthOfPrior is a constant we have estimated to be about 2.3 using an MCMC simulation.

Fatigue

In general, upvote rates decrease as a story receives more attention. The more attention a story has received, the more likely it is that users have already seen it. So if a story spends a lot of time on the home page the upvote rate will eventually start to drop noticeably.

But we'd like a true upvote rate estimate that measures the tendency of the story itself to attract upvotes, and not the amount of attention it has received on Hacker News. To do this we build a fatigue factor into the expected upvote model. In the formula below, the rate of growth of adjustedExpectedUpvotes decays exponentially as expected upvotes grows:

adjustedExpectedUpvotes = (1 - exp(-fatigueFactor * totalExpectedUpvotes)) / fatigueFactor 

We have used the MCMC simulation to estimate a fatigueFactor of about 0.007435115. Plugging in this estimate into the Bayesian Averaging formula, we get our final formula for estimatedUpvoteRate:

estimatedUpvoteRate ≈
    (totalUpvotes + strengthOfPrior)
    / ( (1 - exp(-fatigueFactor * totalExpectedUpvotes)) / fatigueFactor + strengthOfPrior)

Appendix: Possible Improvements

Moving Averages

Looking only at more recent data could make vote manipulation even harder: it would require a constant supply of new votes as the moving average window moves.

The moving average window would be based on expected upvotes, not time, since expected upvotes is roughly proportional to the number of people who have paid attention to a story and thus can be thought of as a proxy for sample size

news's People

Contributors

fdietze avatar joee avatar johannesnakayama avatar johnwarden avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

news's Issues

Deal with flagged stories

Once high quality stories disappear from the official pages, they don't receive any more rank information and therefore don't accumulate more attention.

Cache-Control Header for Frontpage

We can sync these headers with our minutely page generation pattern. The crawl happens every minute on the minute, though it can take a few seconds. Roughly we can tell browsers to cache each page until, say, 10 seconds after the minute mark.

Update expectedUpvotes coefficients using causal model

expectedUpvoteShare (in deltaExpectedUpvotes.go) returns the share of upvotes historically received on average at each rank. We want instead the share of upvotes that the average story would receive if we decided to show it at that rank.

This is different for the same reason that the probability that a hospitalized person will die is not the same as the probability that you will die if you choose to visit the hospital. The effect of rank on upvote rate is confounded by the fact that highly upvoted stories are more likely to be placed at rank 1, in the same way that the effect of hospitals on death is confounded by the fact that people who are dying are more likely to be placed in a hospital.

Use moving average upvoteRate estimate

The general idea is to put more weight on more recent data when calculating upvote rate. For example, we could use the last N units of attention (expectedUpvotes). We don't want to just use the last N datapoints, because if a story is receiving little attention those datapoints provide little information about the true upvote rate. On other hand is a story is at rank 1, the number of upvotes during 10 minutes is probably a good estimate of the true upvote rate.

Remove Upvote Link

[10:33 AM, 11/16/2022] Jonathan Warden: I think we should remove the upvote button. On HN that button only appears if users are logged in. I think that showing will frustrate some users especially if they are not logged in. It also takes up space.
[10:33 AM, 11/16/2022] Felix: ok, I'm fine with that.

SEO Stuff (Not Essential)

  • Put "Hacker News Rankings" or something in the page content (under Quality News)
  • Put story title in slug of /stats pages (e.g. /stats/whats-new-in-python-311-31888624)
  • Structured Data?
  • Meta descriptions and keywords. Include "ycombinator" and "hacker news" in both. And keywords like "hacker news stats python 3.11"
  • Setup Google search console and webmaster tools

Basic SEO Stuff (Essential)

Might as well try to capture a little SEO traffic.

  • Put "Hacker News" in title (e.g. "Quality News: Hacker News Rankings")
  • Put story title and "Hacker News" in title of /stats pages (e.g. "Hacker News Story Stats: What's New in Python 3.11")
  • Breadcrumbs. E.g. https://ourdomain > hacker news story stats > what's new in Python 3.11
  • 301 redirect fly.io address to https://news.social-protocols.org/

Improve Penalty and Resubmission Time Calculation

  • The penalty field is populated with the higher of the most recently calculated penalty, and the value of the penalty field for the previous crawl. However, if the story did not appear in the previous crawl (e.g. it dropped out of the top 90), but does appear in this crawl, then it will not find the previously calculated penalty. This is because when looking for data from the previously crawl, we select on stories where:

    sampleTime = (select max(sampleTime) from dataset where sampleTime != (select max(sampleTime) from dataset))

but this value actually is different for each story. The same problem exists for the resubmission time calculation.

"Diff" or "Compare" Page

A version of our "top" page that shows:

  1. The rank that each story has on the hntop page
  2. A "change" icon like in this page: https://www.tiobe.com/tiobe-index/
  3. An "overriden" or "rejected" section with stories in the HN top page but not the QN top page.
  4. Comparative stats: average quality, average age, average attention, sitewide expected upvotes

The difference in sitewide expected upvotes would be one way to measure comparative value created. This would be the sum of (quality*expectedUpvotesAtRank) for all stories on the page. If an upvote is a proxy of value for users, this is a measure of "net value": value per unit of attention consumed.

Out Of Storage strategy

Since our database is growing indefinitely, we need a strategy to deal with limited storage.

At first, we could just delete old data from the database.
Later see how we can store the huge aggregated dataset.

Frontpage History Table

Table with aggregate stats (average age, score, weighted average quality) for each minute both the HN and QN front page.

Misc. UI Changes For Consistency with HN

  • Highlight "hntop" in white on the hntop page (the original HN does the same for all pages but the front page)
  • Show "Discuss" instead of "0 Comments" when there are no comments
  • "N minutes ago" should link to the story page
  • Username should link to user page
  • Show domain after story title
  • Mobile/smalls viewports:
    • slightly larger right margin (just a few pixels it appears to me)
    • wrap after a |
  • Do we want the same margins around the page content (grey background) as the original
  • when you hover over a relative time (e.g. 2 days ago) on HN, it shows the iso timestamp.
  • usernames for new users appear in green.

"New" Page

This would have the same ordering as the HN new page, of course.

Score Page with Plots

Make "X quality" into a link to a score page that shows stats and history of the story. Ideas:

  • Graph of rank over time
  • Graph of upvotes/attention/quality over time

Final Edits to Readme

Remove Causal Model discussion to "Further Improvements" section. Clarify description of "Hypothetical Upvotes".

Plot follow-up tasks

  • Proper error handling
  • Show "1" at the top of the rank axis
  • Horizontal dotted line at 1.0 (upvote-Rate plot)
  • Show story information at the top of the page: the same information shown on the front page: title, author, submission time, score, estimated upvote rate, HN Rank, QN Rank, etc.
  • Vertical axis of first chart should always go from 1 to 90
  • Label of last chart should be "Upvote Rate"
  • Chart titles. Include story name? Include description, eg. "HN Rank vs QN Rank"
  • Include orange page header on top of page.
  • Sans-serif font in charts?
  • Definition of Expected Upvotes and Upvote Rate and or link to docs. Maybe a * in the charts where we use those words in the title or legend, and then below the chart: "* [concise definition of expected upvotes]" or "* see the About page for the definition of expected upvotes".
  • HTML Title and H!: Include story title and maybe id. Maybe: "Story Stats for 12345: Story Title"
  • Make images fit on mobile screen.
  • LRU cache plots

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.