Retrieve all of the data within nispuf14.dat and store it in a more accessible format Accessible format can be any of the following: csv file json file relational database
We will work with 2 datasets this week: NISPUF14_CODEBOOK.PDF & nispuf14.dat (attached) from the National Center for Immunization and Respiratory Diseases about National Immunizations in Children.
We will need to read the .pdf file to be able to better understand the .dat file, as we will outline below.
NISPUF14_CODEBOOK.PDF is a PDF that contains a description of the format for the data in nispuf14.dat. In other words, the PDF tells you how to read the data in nispuf14.dat.
Why would we need a PDF to tell us how to read our data? Well, this data file is stored in a positional format. This means that both the value and relative position of each character provides meaning within the dataset.
import PyPDF2 from PyPDF2 import PdfFileReader, PdfFileWriter,PdfFileMerger
import pandas as pd import numpy as np import json import tabula library import tabula
In conclusion, we looked at the retrieving and cleansing of data from a survey generated by the National Center for Immunization and Respiratory Diseases about National Immunizations in Children. We were successfully able to pull in the data from the NISPUF14_CODEBOOK.PDF file using pyPDF2 & tabula to place into a clean dataframe with tables form. We were able to clean the data to find only the data in Section 1, that was relevant to the .dat file. Once we cleaned the dataframe and ran pivot_table with column=column names, function to be able to isolate the information the was only needed in the nispuf14.dat file to understand it. After gaining insight of the data from the .pdf file, we succesfully loaded the .dat file and did some cleaning/cleansing on that dataset. It was a tough dataset to read, but with the work done from tabular & pivot_table of the .pdf file, we were able to understand that we only needed lines 0-97 for Section 1 info. We also learned the variable, varibale name, and line placement. We could then define them and place in their unique defined catelog to define in their own library if needed. We finally were successfuly able to retreive all data in the .dat file and store in a more acessible file, being .json &.csv (files attached).
Acrobat, A. (2020). the International Organization for Standardization . In Adobe Acrobat. Retrieved from https://acrobat.adobe.com/us/en/acrobat/about-adobe-pdf.html.
greenvolunteers.org, . (n.d.). Main Features of PDF Format: Pros and Cons. In The World Guide and Database for Volunteer Work In Nature Conservation. Retrieved from https://www.greenvolunteers.org/main-features.html.
Judith, . (2017, March 29). PDF Alternatives โ Other File Formats You Can Use. In PDF2go. Retrieved from https://blog.pdf2go.com/2017/03/29/pdf-alternatives-other-file-formats-you-can-use/.
Knowles, S. (2017, July 11). History of PDF: Creating the World's Most Popular File Format. In PDF Pro. Retrieved from https://www.pdfpro.co/blog/history-pdf.
Massart, R. (2015, April 13). 7 Big Benefits of Using PDF for Business. In Peernet. Retrieved from https://www.peernet.com/pdf-benefits-for-business/.