Giter Site home page Giter Site logo

importing-and-cleaning-data's Introduction

Course-1: Introduction to Importing Data in Python

Reading a Text File using Python

filename = 'huck_finn.txt'

# Open the file in 'r' mode
file = open(filename, mode='r')  # 'r' is for reading

# Read the content of the file
text = file.read()

# Close the file
file.close()

Another way by with

Reading a Text File using Python

filename = 'huck_finn.txt'

# Open the file in 'r' mode using a 'with' statement
with open(filename, mode='r') as file:
    text = file.read()

# The file is automatically closed when the 'with' block is exited
  1. with statement:

    • The with statement in Python is used to ensure that a block of code is executed with a particular context, and when the block is exited, a cleanup or resource release is performed.
    • In the case of working with files, the with statement is commonly used to open a file. It automatically takes care of closing the file when you are done with it.
  2. Opening a file using with statement:

    • with open(filename, mode='r') as file:: This line opens the file specified by the filename in read mode ('r'). The file is assigned to the variable file.
    • The with statement ensures that the file is properly closed when the code block inside it is exited.
  3. Reading the file:

    • text = file.read(): This line reads the entire content of the file and stores it in the variable text. The read() method is used for this purpose.
  4. Automatic closing of the file:

    • As soon as the code block indented under the with statement is executed, the file is automatically closed. You don't need to explicitly call file.close().

By using the with statement, you ensure that the file is closed properly, and it simplifies the code compared to manually managing the opening and closing of the file.

Flat File:

Flat files refer to files that contain records with no structured relationships between the records and no structure for indexing, unlike a relational database. These files typically contain plain text and are used to store tabular data.

There are two common types of flat files:

  1. Text Files:

    • CSV (Comma-Separated Values): In CSV files, each record is a separate line, and fields within the record are separated by commas. It's a widely used format for storing tabular data. Example:
    Name, Age, Occupation
    John, 25, Engineer
    Jane, 30, Scientist
    
    • TSV (Tab-Separated Values): Similar to CSV, but fields are separated by tabs. Example:
    Name    Age    Occupation
    John    25     Engineer
    Jane    30     Scientist
    
    • Fixed-width Files: In these files, each field has a fixed width, and data is aligned accordingly. Example:
    John 25 Engineer
    Jane 30 Scientist
    
  2. Binary Files:

    • Binary flat files store data in a format that is not human-readable. They are more efficient for storing and retrieving data quickly, but they lack human readability. Examples: Database files, images, audio files, etc.

Reading and Writing Flat Files in Python:

  • Reading CSV File:

import csv

filename = 'data.csv'

with open(filename, mode='r') as file:
    reader = csv.reader(file)
    for row in reader:
        print(row)
  • Writing CSV File:

import csv

filename = 'output.csv'

data = [
    ['Name', 'Age', 'Occupation'],
    ['John', 25, 'Engineer'],
    ['Jane', 30, 'Scientist']
]

with open(filename, mode='w', newline='') as file:
    writer = csv.writer(file)
    writer.writerows(data)

Understanding the Power of NumPy for Data Import

1. Why NumPy?

NumPy, short for Numerical Python, is a powerful library in the Python ecosystem that provides support for large, multi-dimensional arrays and matrices, along with a variety of high-level mathematical functions to operate on these arrays. It is an essential tool for data manipulation and analysis, particularly in the fields of machine learning, data science, and scientific computing.

Key features of NumPy include:

  • Efficient and fast array operations
  • Broadcasting capabilities for element-wise operations
  • Tools for integrating C/C++ and Fortran code
  • Linear algebra and mathematical functions

2. Importing Flat Files using NumPy

When working with data stored in flat files (such as CSV files), NumPy provides convenient functions to import the data into arrays. One of the commonly used functions is numpy.loadtxt(). Let's take a look at a basic example:

import numpy as np

# Specify the file path
filename = 'data.csv'

# Load data from the CSV file using NumPy
data = np.loadtxt(filename, delimiter=',')

# Print the loaded data
print(data)
This code snippet demonstrates how to import data from a CSV file using NumPy. The `delimiter=','` parameter specifies that the values in the file are separated by commas.

3. Customizing Your NumPy Import

NumPy provides additional parameters to customize the import process based on the specifics of your data. Let's explore some customization options:

import numpy as np

# Specify the file path
filename = 'MNIST_header.txt'

# Load data from the CSV file using NumPy
# Skip the first row (header) and select only columns 0 and 2
data = np.loadtxt(filename, delimiter=',', skiprows=1, usecols=[0, 2], dtype=str)

# Print the loaded data
print(data)

In this example:

  • skiprows=1 skips the first row (header) of the file.
  • usecols=[0, 2] specifies that only columns 0 and 2 will be loaded into the NumPy array.
  • dtype=str ensures that the data is loaded as strings, which can be crucial when dealing with mixed data types in a column.

Customizing your NumPy import allows you to tailor the loading process to your specific needs, making it a versatile tool for handling diverse datasets.

In conclusion, NumPy plays a crucial role in efficiently importing and manipulating data, providing a solid foundation for various data-related tasks in the Python programming language.

Import file using np.recfromcsv()

The np.recfromcsv() function in NumPy is specifically designed for structured or record arrays, where fields have names and data types. It is useful when dealing with CSV files containing heterogeneous data types and named columns. Here's an example of how you can use np.recfromcsv() to import data from a CSV file:

import numpy as np

# Specify the file path
filename = 'data.csv'

# Load data from the CSV file using np.recfromcsv()
data = np.recfromcsv(filename, delimiter=',')

# Print the loaded data
print(data)

In this example:

  • np.recfromcsv() reads the CSV file and automatically interprets the header to create a structured array with named fields.
  • The delimiter=',' parameter specifies that the values in the file are separated by commas.

If the CSV file has a header row, the function will use the header to name the fields. You can access the data using field names like you would with a dictionary:

# Accessing data by field name
print(data['Name'])
print(data['Age'])
print(data['Occupation'])

This approach is particularly helpful when dealing with CSV files that contain different data types in different columns and when you want to work with named fields rather than numerical indices.


Importing flat files using pandas

Pandas is a popular data manipulation library in Python, and it provides a convenient way to import and manipulate flat files, such as CSV files. The pandas.read_csv() function is commonly used for this purpose. Here's an example of how to import a flat file using pandas:

import pandas as pd

# Specify the file path
filename = 'data.csv'

# Read the CSV file into a DataFrame
df = pd.read_csv(filename)

# Display the DataFrame
print(df)

# View the head of the DataFrame
print(df.head())
# Read the first 5 rows of the file into a DataFrame: data
data = pd.read_csv(file, nrows=5, header=None)

# Build a numpy array from the DataFrame: data_array
data_array =data.values

In this example, pd.read_csv() reads the CSV file and creates a DataFrame, which is a two-dimensional labeled data structure with columns that can be of different data types. The resulting DataFrame (df) can be easily manipulated using various pandas functions.

If your flat file has a different delimiter or uses a different encoding, you can specify these options using additional parameters. For example:

# Specify a different delimiter (e.g., tab-separated values)
df = pd.read_csv('data.tsv', delimiter='\t')

# Specify a different encoding (e.g., UTF-8)
df = pd.read_csv('data.csv', encoding='utf-8')

Pandas also provides functions for reading Excel files (pd.read_excel()), JSON files (pd.read_json()), and many other data formats.

Remember to install pandas before running the code if you haven't already:

pip install pandas

Pandas simplifies the process of working with tabular data and is widely used in data analysis and manipulation tasks.

Example:

# Import pandas library
import pandas as pd

# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt

# Assign filename: file
file = 'titanic_corrupt.txt'

# Import file: data
# Use sep='\t' for tab-separated values, comment='#' for lines starting with '#', and na_values='Nothing' for recognizing 'Nothing' as NA/NaN
data = pd.read_csv(file, sep='\t', comment='#', na_values='Nothing')

# Print the head of the DataFrame
print(data.head())

# Plot 'Age' variable in a histogram
pd.DataFrame.hist(data[['Age']])
plt.xlabel('Age (years)')
plt.ylabel('count')
plt.show()

In this code:

  • sep='\t' specifies that the values in the file are tab-separated.
  • comment='#' indicates that lines starting with '#' should be treated as comments and ignored.
  • na_values='Nothing' specifies that the string 'Nothing' should be recognized as NA/NaN values.

This code reads the data from 'titanic_corrupt.txt' into a DataFrame, prints the first few rows of the DataFrame using head(), and then plots a histogram of the 'Age' variable using matplotlib.pyplot.

Let's explore some additional aspects and functionalities related to importing flat files using Pandas:

Handling Missing Values:

Sometimes, your flat file might contain missing or undefined values. Pandas provides options to handle such cases:

# Specify custom missing values during import
missing_values = ['NA', 'None', '-']
df = pd.read_csv('data_with_missing.csv', na_values=missing_values)

Skipping Rows:

You can skip a specific number of rows at the beginning of the file using the skiprows parameter:

# Skip the first two rows
df = pd.read_csv('data.csv', skiprows=2)

Specifying Column Names:

If your file doesn't have header information, or you want to provide custom column names, you can use the names parameter:

# Specify custom column names
column_names = ['ID', 'Name', 'Age', 'Salary']
df = pd.read_csv('data_no_header.csv', names=column_names)

Reading Specific Columns:

You can read only specific columns from the file by passing the usecols parameter:

# Read only 'Name' and 'Salary' columns
df = pd.read_csv('data.csv', usecols=['Name', 'Salary'])

Skipping Footer:

If your file has metadata or footer information that you want to skip, you can use the skipfooter parameter:

# Skip the last three rows
df = pd.read_csv('data_with_footer.csv', skipfooter=3, engine='python')

Reading Excel Files:

Pandas can also read Excel files using pd.read_excel():

# Read Excel file
df_excel = pd.read_excel('data.xlsx', sheet_name='Sheet1')

Handling Date Formats:

If your flat file contains date columns, you can specify the date format for proper parsing:

# Specify date format
df = pd.read_csv('data_with_dates.csv', parse_dates=['Date'], date_parser=lambda x: pd.to_datetime(x, format='%Y-%m-%d'))

These additional features showcase the flexibility of Pandas in handling various scenarios while importing flat files. Depending on your specific requirements, you can customize the import process accordingly.

Introduction to Other File Types

Other file types

  • Pickled files
  • Excel spreadsheets
  • MATLAB files
  • SAS files
  • Stata files
  • HDF5 files

Pickled Files in Python

What are Pickled Files?

Pickled files are a type of serialized data format native to Python. Serialization refers to the process of converting a Python object into a byte stream. Pickling is the term used for this process in Python.

Motivation for Pickled Files:

Pickling becomes particularly useful when dealing with complex data structures or objects that don't have an obvious and straightforward way to be stored in a text-based format like CSV or JSON. It allows you to serialize and store Python objects for later use.

Introduction to Importing Pickled Files:

To import pickled files in Python, you can use the pickle module. Here's a simple example:

import pickle

# Open the pickled file for reading in binary mode ('rb')
with open('pickled_fruit.pkl', 'rb') as file:
    # Load the pickled data
    data = pickle.load(file)

# Print the loaded data
print(data)

In this example, the open function is used to open the pickled file in binary mode ('rb'), and pickle.load() is used to deserialize the data.

Example Pickled File Content:

Assuming the pickled file contains information about fruit quantities, the loaded data might look like this:

{'peaches': 13, 'apples': 4, 'oranges': 11}

This represents a Python dictionary with fruit names as keys and corresponding quantities as values.


In addition to the commonly used flat files and pickled files, there are various other file types that you might encounter in data analysis and manipulation. Here's a brief introduction to some of them:

1. Excel Spreadsheets:

Excel files are widely used for storing tabular data. Pandas provides a function pd.read_excel() to read data from Excel files. Similarly, you can use pd.to_excel() to write Pandas DataFrames to Excel.

# Read Excel file
df_excel = pd.read_excel('data.xlsx', sheet_name='Sheet1')

# Write DataFrame to Excel
df.to_excel('output.xlsx', index=False)

2. MATLAB Files:

MATLAB files often have a .mat extension and can store matrices, arrays, and other MATLAB-specific data types. You can use the scipy.io module to read MATLAB files.

from scipy.io import loadmat

# Load MATLAB file
mat_data = loadmat('data.mat')

3. SAS Files:

SAS files are associated with the Statistical Analysis System. The pandas library provides a read_sas() function for reading SAS files.

# Read SAS file
df_sas = pd.read_sas('data.sas7bdat')

4. Stata Files:

Stata files have extensions like .dta and are common in the field of statistics. The pandas library supports Stata file reading with read_stata().

# Read Stata file
df_stata = pd.read_stata('data.dta')

5. HDF5 Files:

HDF5 (Hierarchical Data Format version 5) files are designed to store and organize large amounts of data. The h5py library is commonly used to work with HDF5 files.

import h5py

# Open HDF5 file
with h5py.File('data.h5', 'r') as file:
    # Access datasets
    dataset = file['dataset_name']
    # Do something with the dataset

Introduction to Relational Databases

Relational databases are a structured way to organize and store data, and Python provides several libraries to interact with them. One of the most commonly used libraries is SQLite, which is a lightweight, serverless, and self-contained relational database engine. Additionally, the SQLAlchemy library is popular for working with various relational databases in a more abstracted and flexible manner.

1. SQLite - A Simple Relational Database:

SQLite is a C library that provides a lightweight disk-based database. Python comes with built-in support for SQLite through the sqlite3 module. Here's a simple example of working with SQLite in Python:

import sqlite3

# Connect to the SQLite database (creates a new one if it doesn't exist)
conn = sqlite3.connect('example.db')

# Create a cursor object to execute SQL queries
cursor = conn.cursor()

# Execute a SQL query to create a table
cursor.execute('''
    CREATE TABLE IF NOT EXISTS users (
        id INTEGER PRIMARY KEY,
        username TEXT NOT NULL,
        email TEXT NOT NULL
    )
''')

# Insert data into the table
cursor.execute("INSERT INTO users (username, email) VALUES (?, ?)", ('user-1', '[email protected]'))

# Commit the changes and close the connection
conn.commit()
conn.close()

2. SQLAlchemy - An ORM for Relational Databases:

SQLAlchemy is a powerful and flexible Object-Relational Mapping (ORM) library that abstracts the interaction with relational databases. It provides a high-level, Pythonic interface for working with databases, allowing you to work with objects in a more intuitive way. Here's a basic example:

from sqlalchemy import create_engine, Column, Integer, String, MetaData, Table

# Create an SQLite database engine
engine = create_engine('sqlite:///example.db', echo=True)

# Define a Table and metadata
metadata = MetaData()
users = Table('users', metadata,
    Column('id', Integer, primary_key=True),
    Column('username', String, nullable=False),
    Column('email', String, nullable=False)
)

# Create the table in the database
metadata.create_all(engine)

# Insert data into the table
with engine.connect() as conn:
    conn.execute(users.insert().values(username='User', email='[email protected]'))

3. Reading Data:

After creating tables and inserting data, you can retrieve data using SQL queries or ORM queries, depending on the approach you choose:

  • SQLite (with sqlite3):

    conn = sqlite3.connect('example.db')
    cursor = conn.cursor()
    
    # Execute a SELECT query
    cursor.execute("SELECT * FROM users")
    rows = cursor.fetchall()
    
    for row in rows:
        print(row)
    
    conn.close()

another Example

from sqlalchemy import create_engine, MetaData, text
import pandas as pd

# Create an SQLite database engine
engine = create_engine('sqlite:///DB/Chinook.sqlite')

# Connect to the database using a context manager
with engine.connect() as con:
    # Reflect metadata from the database
    metadata = MetaData()
    metadata.reflect(bind=engine)

    # Get the names of all tables in the database
    tables_names = metadata.tables.keys()
    print(tables_names)

    # Using pandas to fetch data from the 'Album' table
    query = text("SELECT * FROM Album")
    Album = pd.read_sql(query, con)

    # Using pandas to fetch data from the 'Artist' table
    query = text("SELECT * FROM Artist")
    Artist = pd.read_sql(query, con)

# Print the first few rows of the 'Album' table
print(Album.head())
print("\n")
# Print the first few rows of the 'Artist' table
print(Artist.head())
  • SQLAlchemy:

    from sqlalchemy import select
    
    # Use the select statement to retrieve data
    with engine.connect() as conn:
        result = conn.execute(select([users]))
        rows = result.fetchall()
    
    for row in rows:
        print(row)

Course-2: Importing flat files from the web

Importing Flat Files from the Web in Python

In Python, you can import and locally save datasets from the web using various libraries. Two commonly used libraries for this purpose are urllib and requests. Additionally, you can load datasets into Pandas DataFrames for easy manipulation and analysis.

1. Using urllib for HTTP Requests:

The urllib module in Python provides functionality to make HTTP requests. Here's a simple example of downloading a file from the web:

from urllib.request import urlretrieve

# URL of the dataset
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv'

# Local filename to save the data and Download the file
urlretrieve(url, 'winequality-white.csv')

how to make a simple GET request using the urllib library to retrieve the HTML content from the Wikipedia homepage. Here's a breakdown of the code:

from urllib.request import urlopen, Request

# URL to make a GET request to
url = "https://www.wikipedia.org/"

# Create a Request object with the specified URL
request = Request(url)

# Open the URL using urlopen to get the response
response = urlopen(request)

# Read the HTML content from the response
html = response.read()

# Close the response object
response.close()

2. Using requests for HTTP Requests:

The requests library is a popular and more user-friendly alternative for making HTTP requests:

import requests

# URL of the dataset
url = "https://www.wikipedia.org/"

# Make a GET request
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Save the content to a local file
    with open('local_data.csv', 'wb') as file:
        file.write(response.content)

3. Loading Datasets into Pandas DataFrames:

Once the data is saved locally, you can use Pandas to load it into a DataFrame:

import pandas as pd

# Read the CSV file into a DataFrame
df = pd.read_csv('local_data.csv')

# Now 'df' contains the data from the web in a DataFrame
print(df.head())
# Import package
import pandas as pd
from urllib.request import urlretrieve

# Assign url of file: url
url = 'https://assets.datacamp.com/course/importing_data_into_r/latitude.xls'

# Download the file using urlretrieve
urlretrieve(url, 'latitude.xls')

# Read in all sheets of Excel file: xls
xls = pd.read_excel('latitude.xls', sheet_name=None)

# Print the sheet names to the shell
print("Sheet names:", list(xls.keys()))

# Print the head of the first sheet (using its name, NOT its index)
first_sheet_name = '1700'
print(f"\nHead of the sheet '{first_sheet_name}':")
print(xls[first_sheet_name].head())

4. Web Scraping with BeautifulSoup:

If the data is embedded in HTML and not available as a downloadable file, you can use the requests library along with BeautifulSoup for web scraping:

from bs4 import BeautifulSoup
import requests

# URL of the web page
url = 'https://www.crummy.com/software/BeautifulSoup/'

# Make a GET request
response = requests.get(url)

# Parse the HTML content
html_doc = response.text
soup = BeautifulSoup(html_doc, 'html.parser')

# Prettified Soup
print(soup.prettify())

# Extracting title
print("Title:", soup.title)

# Extracting text
print("Text:", soup.get_text())

# Extracting links
for link in soup.find_all('a'):
    print("Link:", link.get('href'))

5. Handling Authentication:

If the web server requires authentication, you can provide credentials with requests:

import requests
from requests.auth import HTTPBasicAuth

# URL of the dataset requiring authentication
url = 'https://example.com/data.csv'

# Provide authentication credentials
auth = HTTPBasicAuth('username', 'password')

# Make a GET request with authentication
response = requests.get(url, auth=auth)

Introduction to APIs and JSONs

APIs (Application Programming Interfaces):

APIs define a set of rules and protocols that allow different software applications to communicate with each other. In the context of web development, APIs often enable interaction with external services or data sources. Python provides libraries such as requests to work with APIs.

1. Making HTTP Requests:

The requests library simplifies the process of making HTTP requests to APIs. Here's a simple example:

import requests

# URL of the API endpoint
api_url = 'http://www.omdbapi.com/?t=hackers'

# Make a GET request to the API
response = requests.get(api_url)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Parse the JSON response
    json_data = response.json()
    for key, value in json_data.items():
        print(f"{key}: {value}")
else:
    print(f"Error: {response.status_code}")
import requests

# Assign the URL with query parameters
url = 'http://www.omdbapi.com/?apikey=72bc447a&'

# Send a GET request to the URL
response = requests.get(url)

# Print the text of the response
print(response.text)

# Import package
import requests

# Assign URL to variable: url
url = "https://en.wikipedia.org/w/api.php?action=query&prop=extracts&format=json&exintro=&titles=pizza"

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Decode the JSON data into a dictionary: json_data

json_data = r.json()
# Print the Wikipedia page extract
pizza_extract = json_data['query']['pages']['24768']['extract']
print(pizza_extract)

2. Working with JSON Data:

APIs often return data in JSON (JavaScript Object Notation) format. Python has a built-in module called json for working with JSON data:

import json

# Sample JSON data
json_data = '{"name": "John", "age": 30, "city": "New York"}'

# Parse JSON string to Python dictionary
python_data = json.loads(json_data)

# Convert Python dictionary to JSON string
new_json_data = json.dumps(python_data, indent=2)

print(python_data)
print(new_json_data)
import json

# Open the JSON file for reading
with open('snakes.json', 'r') as json_file:
    # Load the JSON data from the file
    json_data = json.load(json_file)

# Iterate through the key-value pairs in the JSON data
for key, value in json_data.items():
    # Print the key and its corresponding value
    print(key + ':', value)

3. Handling Authentication:

If an API requires authentication, you can include credentials in the request headers:

import requests

api_url = 'https://api.example.com/data'
headers = {'Authorization': 'Bearer YOUR_ACCESS_TOKEN'}

response = requests.get(api_url, headers=headers)

if response.status_code == 200:
    data = response.json()
    print(data)
else:
    print(f"Error: {response.status_code}")

4. Making POST Requests:

In addition to GET requests, you might need to make POST requests to send data to the API:

import requests

api_url = 'https://api.example.com/data'
data_to_send = {'key1': 'value1', 'key2': 'value2'}

response = requests.post(api_url, data=data_to_send)

if response.status_code == 200:
    data = response.json()
    print(data)
else:
    print(f"Error: {response.status_code}")

5. Using APIs with Python Libraries:

Certain Python libraries provide convenient ways to work with specific APIs. For example, the Tweepy library simplifies interactions with the Twitter API, and the praw library is designed for the Reddit API.

import tweepy
import json

# Replace the placeholders with your Twitter API credentials
access_token = "your_access_token"
access_token_secret = "your_access_token_secret"
consumer_key = "your_consumer_key"
consumer_secret = "your_consumer_secret"

# Create an OAuthHandler instance
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)

# Set the access token and access token secret
auth.set_access_token(access_token, access_token_secret)

importing-and-cleaning-data's People

Contributors

mahmoudnamnam avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.