Giter Site home page Giter Site logo

usbroadbandusagepercentages's Introduction

United States Broadband Usage Percentages Dataset

John Kahan - Vice President, Chief Data Analytics Officer | Juan Lavista Ferres - Chief Scientist, Microsoft AI for Good Research Lab

Please contact [email protected] for questions.

Microsoft Introduction and Purpose

We are publishing datasets we developed as part of our efforts with Microsoft’s Airband Initiative to help close the rural broadband gap. The data can be used for the purpose of analyzing, understanding, improving, or addressing problems related to broadband access.

The datasets consist of data derived from anonymized data Microsoft collects as part of our ongoing work to improve the performance and security of our software and services. The data does not include any PII information including IP Address. We also suppress any location with less than 20 devices. Other than the aggregated data shared in this data table, no other data is stored during this process. We estimate broadband usage by combining data from multiple Microsoft services. The data from these services are combined with the number of households per county and zip code[1]. Every time a device receives an update or connects to a Microsoft service, we can estimate the throughput speed of a machine. We know the size of the package sent to the computer, and we know the total time of the download. We also determine zip code level location data via reverse IP. Therefore, we can count the number of devices that have connected to the internet at broadband speed per each zip code based on the FCC’s definition of broadband that is 25mbps per download[2]. Using this method, we estimate that ~120.4 million people in the United States are not using the internet at broadband speeds.

Background

Every day our world becomes a little more digital. But reaping the benefits of this digital world – pursuing new educational opportunities through distance learning, feeding the world through precision agriculture, growing a small business by leveraging the cloud, and accessing better healthcare through telemedicine – is only possible for those with a broadband connection, which is especially apparent now as more people are staying home due to the COVID-19 pandemic. Based on the Fourteenth Broadband Deployment Report from the Federal Communications Commission (FCC)[4], broadband is not available to at least 14.5 million people, 11.3 million of whom live in this country’s rural areas. Getting these numbers right is vitally important. This data is used by federal, state, and local agencies to decide where to target public funds dedicated to closing this broadband gap. That means millions of Americans already lacking access to broadband have been made invisible, substantially decreasing the likelihood of additional broadband funding or much needed broadband service. We are publishing this data today to allow others to use it to develop solutions to improving broadband access or addressing problems with broadband access.

broadbandmap.png Figure 1: Map of the United States by county with indicators of broadband availability and broadband usage

Broadband Usage Percentages Zip Code Dataset: A Differentially Private Data Release

The initial dataset released in April 2020 provided broadband usage percentages at a US county-level. In December 2020, we are adding a zip code-level view of this same information. The data is to be used for the purpose of analyzing, understanding, improving, or addressing problems related to broadband access. As mentioned in the April 2020 release, the Broadband Usage Percentages Dataset is derived from aggregated and anonymized data Microsoft collects as part of our ongoing work to improve software and service performance and security. Given the zip code-level dataset provides a more granular view of broadband usage percentages by households, we took the additional step to ensure data privacy guarantees by utilizing differential privacy, a technique that adds noise to the data aggregations, preventing leakage about the presence of specific individuals in the dataset. We implemented differential privacy through the SmartNoise platform, a first-of-its-kind open source platform for differential privacy co-developed by Microsoft and the OpenDP initiative led by Harvard. We estimate broadband usage by combining privatized data from multiple Microsoft services. As Differential Privacy adds noise to protect privacy, the noise added to zip codes with a small number of households can impact utility. To ensure transparency into how zip codes with different population magnitudes are affected, we have included error range data. To read more about how differential privacy has been applied to this data, read the Broadband usage differential privacy paper

broadbandusagezipcode.png Figure 2: Map of the United States by zip codes with indicators of broadband usage

Data table

Data contained in the data table includes counties in the United States

Data contained in the zip code data table includes the following fields:

  • ST: is the 2 letter abbreviation of states in the United States https://www.iso.org/obp/ui/#iso:code:3166:US
  • COUNTY ID: 4 to 5 digit code used to represent the county (last 3 digits) and the state (first digit or first 2 digits) https://www.census.gov/geographies/reference-files.html
  • ZIP CODE: 5 digit code used to represent geographic area used by the United State Postal Service
  • BROADBAND USAGE: percent of people per county that use the internet at broadband speeds based on the methodology explained above. Data is from October 2020.
  • ERROR RANGE (MAE)(+/-) : mean absolute error (MAE). The non-private broadband coverage estimate will be, on average, within the mean absolute error (MAE) error range.
  • ERROR RANGE (95%)(+/-) : 95th percentile error range. For 95% of the time, the non-private broadband coverage estimate for zip codes with a similar number of households will be within 95th percentile error range.
  • MSD: We also provide the mean signed deviation (MSD). The mean signed deviation offers an estimate of bias introduced by the process.
  • For a detailed explanation of the differential privacy methodology, please refer to https://arxiv.org/pdf/2103.14035.pdf

Suggested datasets that can be used in combination with the county level data:

Here are links to additional broadband information:

  1. American Fact Finder https://data.census.gov/ and Office of Policy Development Research https://www.huduser.gov/portal/datasets/usps_crosswalk.html
  2. “2018 Broadband Deployment Report | Federal Communications Commission.” https://www.fcc.gov/reports-research/reports/broadband-progress-reports/2018-broadband-deployment-report (accessed Apr. 15, 2020).
  3. “2019 Broadband Deployment Report,” Federal Communications Commission, Jun. 11, 2019. https://www.fcc.gov/reports-research/reports/broadband-progress-reports/2019-broadband-deployment-report (accessed Apr. 15, 2020).
  4. “Fourteenth Broadband Deployment Report," Federal Communications Commission, Jan 19, 2021. https://www.fcc.gov/reports-research/reports/broadband-progress-reports/fourteenth-broadband-deployment-report

Here are links to differential privacy information:

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

The data in the broadband_data.csv file is licensed under the Open Use of Data Agreement v1.0. Data sources include Microsoft’s Rural Broadband initiative and the Federal Communications Commission’s 2019 Broadband Deployment Report.

usbroadbandusagepercentages's People

Contributors

allen10101-zz avatar kasavin avatar lucas-a-meyer avatar microsoft-github-operations[bot] avatar microsoftopensource avatar prbatero avatar v-javdel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

usbroadbandusagepercentages's Issues

Zip Code - County

The zip code to county matching in broadband_data_zipcode.csv seems a bit erroneous. There are duplicates of zip code by county which initially seems ok because zip codes don't follow county boundaries. But, for example, 56318 shows to be in 27153 (Todd County) and 27111 (also Todd County but should be Otter Tail County) but in reality it is in 27153 (Todd County) and 27097 (Morrison County). Perhaps your lookup is coming from a kind of geoip location but sometimes the county seems to be wrong for a zip code. Perhaps you meant this to be a distinct list of postal codes and the duplicates are unintended. They only exist in AK, MN, and SD.
image

Document Constraints & Influences

The paper suggests that the test is able to estimate aggregate connection speeds but is silent on factors influencing & constraining the measurements. For example, does the test sense whether other users are concurrently using the connection and thereby reducing available per-user or per-device capacity (and not record a result in the presence of competing traffic)? Does the test record whether the device was on WiFi or ethernet and can the dataset be separated based on connection type? For WiFi connections, is the 802.11 protocol, WiFi SNR or number of other clients on the WiFi network recorded in the test, potentially enabling filter out older/slow WiFi standards, devices at the edge of coverage, etc.? What about device type and device capabilities (e.g. do you record if it is a 15 year old laptop that cannot exceed 10 Mbps)?

Reference: https://cacm.acm.org/magazines/2020/12/248801-measuring-internet-speed

impact of Delivery Optimization feature on these speed inferences

Microsoft's Delivery Optimization feature allows users to "Limit how much bandwidth is used for downloading updates". Users can limit "Absolute bandwidth" used for downloads or they can limit the "Percentage of measured bandwidth" used. Is Microsoft able to discern which measurements included within its data pool were constrained by user self-imposed Delivery Optimization throttling, which is facilitated by Microsoft itself, and is that erroneous and misleading data removed from the data pool?
Screen Shot 2021-05-11 at 3 52 25 PM

https://docs.microsoft.com/en-us/windows/deployment/update/waas-delivery-optimization

test servers

Does Microsoft have sufficient server capacity to deliver content that saturates every single access line of connections that it is severing software updates to?
Are Microsoft's servers geographically distributed to optimize update distribution?
Is routing optimized for end-user experience and performance or to minimize Microsoft's costs (since this is a cost center)?
Does Microsoft have sufficient interconnection capacity to support surges in software update demands?

Number of TCP Connections

The paper does not mention the number of TCP connections used in conducting the tests. It is well-known that using a single connection will tend to underestimate aggregate connection speed so it seems to be important to share more details on how the underlying test works.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.