philipuit,UnclePhilip,github

air-quality-dataset

datasource for 'Air Quality' dataset :http://archive.ics.uci.edu/ml/machine-learning-databases/00360/. from IPython.display import display from IPython.display import Image. import pandas as pd import numpy as np from sklearn.impute import SimpleImputer

basketball_reference_scraper

scrapping & transforming 'https://www.basketball-reference.com/teams/'

binance-api

A python library that implements the Binance Exchange REST API and Web socket communication

census_income

Instructions Use the attached "Adult" data set (http://arcchive.ics.uci.edu/ml/datasets/Census+Income) of census data collected to predict income for the following steps. The basic idea is to use the apply() function (Chapter 9) to clean the data, and the split-apply-combine pattern (Chapter 10) to analyze it. 1. Similar to last week, replace '-' with spaces, where appropriate, using the apply() function. 2. Determine how to deal with missing values (if any) and use apply() to make the changes.i 3. Use apply() with Use Defined Functions (UDFs) to analyze missing values, similar to page 178 (if appropriate). 4. Use the grouping and aggregation methods in Chapter 10 to analyze data vs. income in several different ways. FOR EXAMPLE: education vs. income, job vs. income, job & education vs. income... etc. (This is NOT an exhaustive list. I expect you to do more). Remember to document your steps and reasoning using markdown cells.

coinmarketcap_scraper

scrape coinmarketcap (cmc) for crypto prices

cookbook

The Data Engineering Cookbook

cryptocurrency-stats-tracker

A coinmarketcap historical data scrapper using pandas, mongodb

csv-to-json_and_json-to-csv_roundtrip_converter

we examined CSV and JSON file formats. We wrote code to manually convert a specific CSV file to a specific JSON in the process. We the functions to do a "round-trip" (CSV->JSON->CSV or JSON->CSV->JSON) on the Consumer Complaint Database data found at https://catalog.data.gov/dataset/consumer-complaint-database#topic=consumer_navigation

data-engineering-zoomcamp

Free Data Engineering course!

data_assembly_and_missing_data

Using Data Assembly to Tidy Data, Add rows, Add columns, and Merge Data. Using Missing Data strategies to find & deal with missing data, use sklearn, SimpleImputer, import modules, joins/merge dataframes, and concat tables based on unique identifiers. Source: https://github.com/chendaniely/pandas_for_everyone/tree/master/data

denver_international_airport_climate_data

The attached data set is climate data from Denver International Airport for the first half of February, 2019. Drop any columns you deem unnecessary. Set the date column as the index of the DataFrame. Create an "Elapsed Time" column that shows the amount of time since the first observation. Format the "Elapsed Time" column into some easily-readable form. For example, after two hours, the column should NOT read 7200. Do all the things we've already been doing -- format the headings, deal with missing values, etc. Perform analysis with the tools we've looked at so far. Keep in mind that the data may have to be grouped to be meaningful (average temp per day may be more useful than average for the whole two weeks, for example). Justify your analysis choices. Deliverable is your Jupyter notebook. Remember, just attach the notebook. Don't change the file extension and don't zip it."

docs-pages

The hosted static files for the Holochain developer documentation

dow_jones_index_full_analysis

The purpose of this lab is to use models to look for relationships between observed features and their outcomes. Based on the content of the dataset, it would be intersting to see if there is any sort of correlation between some cruicial variables in this dataset. At first glance, this is a pretty basic dataset, but after running the dataset through some of the methods we will demonstrate, we will look to find unique observations in the dataset. We will look to aggregate some information by unique stock, look to find correlation in the dataset that could lead to uniique understanding or deeper analysis, and also we will hope to uncover information that could possibly lead to future understandings/correlations by training data and test sets to look for unique inear regression models. Potentially we will discover something groundbreaking and will lead to learning events from the past that can elude to future events occuring? We will see! We will primarily look at volume and it's affect on other variables.

floodrisk

Study & assessment of probable impact of catastrophic flood events & manage flood risk with a first order flood-fill model developed using Python geospatial libraries

gapminder

Excerpt from the Gapminder data, as an R data package and in plain text delimited form

holo-nixpkgs

Modules, packages and profiles that drive Holo, Holochain, and HoloPortOS

hypothesis_testing

Using R studio, we will perform a paired t-test of two means of sample populations. By comparing the means of the datasets, we are searching for the equality of means of the 2 samples with unknown variances, to see if the sets of data are somehow related. In this exercise, we will test the effectiveness of a new training method used by a new athletic trainer at a school. This scenario shows 10 runners, before and after training results, with two different new coaches using the same population every time. We will be testing to see if there are variances between the 2 new coaches by comparing their means before and after training.

khronos

A flexible python library for building your own cron-like system, with REST APIs and a Web UI.

multiple_excel_files-into-sqlite_database

Getting data file names of 10 files 2009-2019 of aapl stock price data and loading into SQlite DB using openyxl, dataset, glob, sqlite3, pandas, and numpy

national_center_for_immunization_and_respiratory_diseases_about_national_immunizations_in_children

Retrieve all of the data within nispuf14.dat and store it in a more accessible format Accessible format can be any of the following: csv file json file relational database For this assignment, feel free to use a dataframe (python library Pandas) for intermediate steps. We will work with 2 datasets this week: NISPUF14_CODEBOOK.PDF & nispuf14.dat (attached) from the National Center for Immunization and Respiratory Diseases about National Immunizations in Children. We will need to read the .pdf file to be able to better understand the .dat file, as we will outline below. NISPUF14_CODEBOOK.PDF is a PDF that contains a description of the format for the data in nispuf14.dat. In other words, the PDF tells you how to read the data in nispuf14.dat. Why would we need a PDF to tell us how to read our data? Well, this data file is stored in a positional format. This means that both the value and relative position of each character provides meaning within the dataset.

noaa_weather_api

We will be working with NOAAs weather API

pandas_for_everyone

Repository to accompany "Pandas for Everyone"

py-shiny

Shiny for Python

pyspark-testing-env

Example Repo to have full end to end pyspark testing via docker-compose

rest_api

REST API using Flask, triggering workflow DAGs in Apache Airflow upon request; while CouchDB allows end user application to query the state of request via API and the scripts in Airflow update their status triggered by REST calls in the workflow with Dockers

simple_hadoop_mapreduce_example

A simple example of Hadoop MapReduce in Python.

uci_ml_archive

from the Bank Marketing data set from the UCI ML Archive (http://archive.ics.uci.edu/ml/datasets/Bank+Marketing). The data set has 20 feature columns plus one result column and we need to do some work to get it ready for further processing. 1. Reference the bank-additional-names.txt file for column types and what the names mean. 2. Make the following changes: Change column names to remove abbreviations, capitalize, add spaces, and generally make the names more "meaningful" to casual readers. Change column types to match the associated feature types. Replace word separators in strings like "-" or "." with spaces. 3. Missing Attribute Values: There are several missing values in some categorical attributes, all coded with the "unknown" label. These missing values can be treated as a possible class label or using deletion or imputation techniques.

unpriced-flood-risk

wholesale_customer_data

dataset used 'Wholesale customers data' from UCI Machine Learning Repository is from this source: http://archive.ics.uci.edu/ml/datasets/Wholesale+customers. This script will create DataFrames, filter, aggregate, silce, groupby, and compare variables using Pandas functions

worldbank_gdp

We will be using the "government expenditure on education" dataset as used in the intro. We will be deleting useless columns and analyze the data and evaluate how reshaping the dataset will help change and make analysis easier. We use the dataset found here: https://databank.worldbank.org/source/education-statistics-%5e-all-indicators and look at GDP % spending by country, verbatim to the instructions to avoid confusion. We will discuss how we tidy the data and reshape it throughout to make the dataset easier and better fot analysis as we progress. import os import requests import pandas as pd import numpy as np

philipuit Goto Github PK

UnclePhilip's Projects

Recommend Projects

Recommend Topics

Recommend Org