The national_center_for_immunization_and_respiratory_diseases_about_national_immunizations_in_children from i10brook

National Center for Immunization and Respiratory Diseases about National Immunizations in Children

Retrieve all of the data within nispuf14.dat and store it in a more accessible format Accessible format can be any of the following: csv file json file relational database

About the datasets:

We will work with 2 datasets this week: NISPUF14_CODEBOOK.PDF & nispuf14.dat (attached) from the National Center for Immunization and Respiratory Diseases about National Immunizations in Children.

We will need to read the .pdf file to be able to better understand the .dat file, as we will outline below.

NISPUF14_CODEBOOK.PDF is a PDF that contains a description of the format for the data in nispuf14.dat. In other words, the PDF tells you how to read the data in nispuf14.dat.

Why would we need a PDF to tell us how to read our data? Well, this data file is stored in a positional format. This means that both the value and relative position of each character provides meaning within the dataset.

Setup importing packages we will need

import PyPDF2 from PyPDF2 import PdfFileReader, PdfFileWriter,PdfFileMerger

import pandas as pd import numpy as np import json import tabula library import tabula

Summary

In conclusion, we looked at the retrieving and cleansing of data from a survey generated by the National Center for Immunization and Respiratory Diseases about National Immunizations in Children. We were successfully able to pull in the data from the NISPUF14_CODEBOOK.PDF file using pyPDF2 & tabula to place into a clean dataframe with tables form. We were able to clean the data to find only the data in Section 1, that was relevant to the .dat file. Once we cleaned the dataframe and ran pivot_table with column=column names, function to be able to isolate the information the was only needed in the nispuf14.dat file to understand it. After gaining insight of the data from the .pdf file, we succesfully loaded the .dat file and did some cleaning/cleansing on that dataset. It was a tough dataset to read, but with the work done from tabular & pivot_table of the .pdf file, we were able to understand that we only needed lines 0-97 for Section 1 info. We also learned the variable, varibale name, and line placement. We could then define them and place in their unique defined catelog to define in their own library if needed. We finally were successfuly able to retreive all data in the .dat file and store in a more acessible file, being .json &.csv (files attached).

References:

Acrobat, A. (2020). the International Organization for Standardization . In Adobe Acrobat. Retrieved from https://acrobat.adobe.com/us/en/acrobat/about-adobe-pdf.html.

greenvolunteers.org, . (n.d.). Main Features of PDF Format: Pros and Cons. In The World Guide and Database for Volunteer Work In Nature Conservation. Retrieved from https://www.greenvolunteers.org/main-features.html.

Judith, . (2017, March 29). PDF Alternatives – Other File Formats You Can Use. In PDF2go. Retrieved from https://blog.pdf2go.com/2017/03/29/pdf-alternatives-other-file-formats-you-can-use/.

Knowles, S. (2017, July 11). History of PDF: Creating the World's Most Popular File Format. In PDF Pro. Retrieved from https://www.pdfpro.co/blog/history-pdf.

Massart, R. (2015, April 13). 7 Big Benefits of Using PDF for Business. In Peernet. Retrieved from https://www.peernet.com/pdf-benefits-for-business/.

i10brook / national_center_for_immunization_and_respiratory_diseases_about_national_immunizations_in_children Goto Github PK

national_center_for_immunization_and_respiratory_diseases_about_national_immunizations_in_children's Introduction

National Center for Immunization and Respiratory Diseases about National Immunizations in Children

About the datasets:

Setup importing packages we will need

Summary

References:

national_center_for_immunization_and_respiratory_diseases_about_national_immunizations_in_children's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent