Link: AWS Machine Learning Foundations Course
- "using vectorized operations and more efficient data structures can optimize your code" - what are vectorized operations?
Links
- How to version control your production machine learning models
- https://github.com/ShuaiW/data-science-question-answer
- cool repo, some good stuff
-
TDD: Test Driven Development: write tests before the code
-
unit tests: a test that covers a small unit of code
-
install pytest:
pip install -U pytest
- You need your test files to start with the word "test" as in
test_nearest.py
and each test function must start with "test" as indef test_nearest_square_5():
- You then just type
pytest
in the terminal to run the tests
Is the code clean and modular?
- Can I understand the code easily?
- Does it use meaningful names and whitespace?
- Is there duplicated code?
- Can you provide another layer of abstraction?
- Is each function and module necessary?
- Is each function or module too long?
Is the code efficient?
- Are there loops or other steps we can vectorize?
- Can we use better data structures to optimize any steps?
- Can we shorten the number of calculations needed for any steps?
- Can we use generators or multiprocessing to optimize any steps?
Is documentation effective?
- Are in-line comments concise and meaningful?
- Is there complex code that's missing documentation?
- Do function use effective docstrings?
- Is the necessary project documentation provided?
Is the code well tested?
- Does the code high test coverage?
- Do tests check for interesting cases?
- Are the tests readable?
- Can the tests be made more efficient?
Is the logging effective?
-
Are log messages clear, concise, and professional?
-
Do they include all relevant and useful information?
-
Do they use the appropriate logging level?
-
Use a linter like pylint
Links
- https://www.linkedin.com/pulse/data-science-test-driven-development-sam-savage/
- https://engineering.pivotal.io/post/test-driven-development-for-data-science/
- https://medium.com/uk-hydrographic-office/test-driven-development-is-essential-for-good-data-science-heres-why-db7975a03a44
- https://docs.python-guide.org/writing/tests/
- Code Reviews
- Objects have characteristics and can perform actions
- An object is a specific instance of something whereas a class is the generic version of the object, or blueprint of it
- Here are some terms worth knowing:
- class - a blueprint consisting of methods and attributes
- object - an instance of a class. It can help to think of objects as something in the real world like a yellow pencil, a small dog, a blue shirt, etc. However, as you'll see later in the lesson, objects can be more abstract.
- attribute - a descriptor or characteristic. Examples would be color, length, size, etc. These attributes can take on specific values like blue, 3 inches, large, etc.
- method - an action that a class or object could take
- OOP - a commonly used abbreviation for object-oriented programming
- encapsulation - one of the fundamental ideas behind object-oriented programming is called encapsulation: you can combine functions and data all into a single entity. In object-oriented programming, this single entity is called a class. Encapsulation allows you to hide implementation details much like how the scikit-learn package hides the implementation of machine learning algorithms.
- method vs function:
- a method is a function inside a class while a function is outside of a class
- when writing class methods, notice how you don't have to pass self in as an argument; it is passed implicitly
- If you saved your Shirt class in a file called
shirt.py
, you would import it by doing the following:
from shirt import Shirt
- this assumes that your class is named "Shirt" (with a capital "S")
- There are a number of drawbacks of accessing object properties directly vs. using getter and setter methods. Python is looser than other OO languages
-
Gaussian (normal distribution) calculator: http://onlinestatbook.com/2/calculators/normal_dist.html
-
Binomial Distribution calculator: http://onlinestatbook.com/2/calculators/binomial_dist.html
-
The exercise on magic methods was interesting. Notice in the
magic_methods.py
file there is an__add__
method. I'm overwriting Python's normal add method so that when I do the following I don't get an error:
gaussian_one = Gaussian(25, 3)
gaussian_two = Gaussian(30, 4)
gaussian_sum = gaussian_one + gaussian_two # __add__ magic method
Inheritance
- Inheritance is pretty self-explanatory in Python. Here is an example of the Shirt class that inherits from Clothing:
class Clothing:
def __init__(self, color, size, style, price):
self.color = color
self.size = size
self.style = style
self.price = price
def change_price(self, price):
self.price = price
def calculate_discount(self, discount):
return self.price * (1 - discount)
def calculate_shipping(self, weight, rate):
return weight * rate
class Shirt(Clothing):
def __init__(self, color, size, style, price, long_or_short):
Clothing.__init__(self, color, size, style, price)
self.long_or_short = long_or_short
def double_price(self):
self.price = 2*self.price
Clothing
is pretty normal, nothing exciting thereShirt
first has(Clothing)
on the class defintion line- Notice the
__init__
method; it's a normal__init__
method except you first call the Clothing class and then set any properties for your Shirt class
Advanced OOP Topics
Here are some Python-focused OOP articles and concepts:
- class methods, instance methods, and static methods: these are different types of methods that can be accessed at the class or object level
- class attributes vs instance attributes: you can also define attributes at the class level or at the instance level
- multiple inheritance, mixins: A class can inherit from multiple parent classes
- Python decorators: Decorators are a short-hand way for using functions inside other functions
Making a package
-
I won't go through everything here are the basics. You can see what's really happening in the folder in this repo: 3a_python_package
-
my_python_package (package_root)
setup.py
(sets up package)- distributions (code for my package)
__init__
: the init code for my packageGeneraldistribution.py
: the parent class for my Gaussian distribution classGaussiandistribution.py
: Gaussian distribution class
-
To use it, I could go to that folder and do
pip install .
- this will install it. And then do
python
in the Terminal to bring up the interpreter:
from distributions import Gaussian
gaussian_one = Gaussian(25, 2)
gaussian_one.mean
gaussian_one + gaussian_one
- good overview of how to contribute to a GitHub project: https://github.com/MarcDiethelm/contributing/blob/master/README.md or https://akrabat.com/the-beginners-guide-to-contributing-to-a-github-project/
Uploading Package to PyPi
cd binomial_package_files
python setup.py sdist
pip install twine
# commands to upload to the pypi test repository
twine upload --repository-url https://test.pypi.org/legacy/ dist/*
pip install --index-url https://test.pypi.org/simple/ dsnd-probability
# command to upload to the pypi repository
twine upload dist/*
pip install dsnd-probability
- tutorial on creating packages: https://packaging.python.org/tutorials/packaging-projects/
- Types of ML Techniques
- Supervised Learning
- every training example has a corresponding label
- Unsupervised Learning
- No labels for training data
- Most Generative AI is unsupervised learning
- Reinforcement Learning
- learns through consequences of action in specific environment
- Supervised Learning
- Generative AI is one of the most promising new technologies
- Generative AI pits two different neural networks against each other to produce new and original digital works based on sample inputs
their notes: Machine Learning Techniques
Supervised Learning: Models are presented wit input data and the desired results. The model will then attempt to learn rules that map the input data to the desired results.
Unsupervised Learning: Models are presented with datasets that have no labels or predefined patterns, and the model will attempt to infer the underlying structures from the dataset. Generative AI is a type of unsupervised learning.
Reinforcement learning: The model or agent will interact with a dynamic world to achieve a certain goal. The dynamic world will reward or punish the agent based on its actions. Overtime, the agent will learn to navigate the dynamic world and accomplish its goal(s) based on the rewards and punishments that it has received.
- AWS DeepComposer is how Amazon is teaching developers how to use GAN (Generative Adversarial Networks) to generate music
- GANs pit 2 networks, a generator and a discriminator, against each other to generate new content.
- generator: creates new output
- discriminator: evaluates the quality of output AND provides feedback
- Each iteration of the training cycle is called an epoch
- The goal of iterating and completing epochs is to improve the output or prediction of the model.
- output that deviates from the ground truth is referred to as an error.
- The measure of an error, given a set of weights, is called a loss function.
- Weights represent how important an associated feature is to determining the accuracy of a prediction
- loss functions are used to update the weights after every iteration.
- Ideally, as the weights update, the model improves making less and less errors.
- Convergence happens once the loss functions stabilize.
Challenges with GANs
-
Clean datasets are hard to obtain
-
Not all melodies sound good in all genres
-
Convergence in GAN is tricky – it can be fleeting rather than being a stable state
- if you keep training it, it could be training on junk feedback
-
Complexity in defining meaningful quantitive metrics to measure the quality of music created
-
the Similarity Index trends toward zero but not necessarily reach zero
- Generative AI techniques include:
- Generative Adversarial Networks (GANs)
- Variational Autoencoders
- Transformers
GANs
- The generator and the discriminator are trained in alternating cycles such that the generator learns to produce more and more realistic data while the discriminator iteratively gets better at learning to differentiate real data from the newly created data.
- The generator network used in AWS DeepComposer is adapted from the U-Net architecture, a popular convolutional neural network
- Order of steps in the U-Net architecture:
- Input
- Encoder
- Latent space
- Decoder
- Output
- In the case of AWS DeepComposer:
The network consists of an “encoder” that maps the single track music data (represented as piano roll images) to a relatively lower dimensional “latent space“ and a ”decoder“ that maps the latent space back to multi-track music data.
- The discriminator loss has been found to correlate well with sample quality.
How the Model Works
The model consists of two networks, a generator and a critic. These two networks work in a tight loop:
- The generator takes in a batch of single-track piano rolls (melody) as the input and generates a batch of multi-track piano rolls as the output by adding accompaniments to each of the input music tracks.
- The discriminator evaluates the generated music tracks and predicts how far they deviate from the real data in the training dataset.
- The feedback from the discriminator is used by the generator to help it produce more realistic music the next time.
- As the generator gets better at creating better music and fooling the discriminator, the discriminator needs to be retrained by using music tracks just generated by the generator as fake inputs and an equivalent number of songs from the original dataset as the real input.
- We alternate between training these two networks until the model converges and produces realistic music.
- The discriminator is a binary classifier which means that it classifies inputs into two groups, e.g. “real” or “fake” data.