Giter Site home page Giter Site logo

readings's Introduction

Reading and Discussion List for Stanford CS348K

Lecture 1: Throughput Computing Review

Post-Lecture Required Readings: (2)

  • The Compute Architecture of Intel Processor Graphics Gen9. Intel Corporation

    • This is not an academic paper, but a whitepaper from Intel describing the architectural geometry of a recent GPU. I'd like you to read the whitepaper, focusing on the description of the processor in Sections 5.3-5.5. Then, given your knowledge of the concepts discussed in lecture (such as superscalar execution, multi-core, multi-threading, etc.), I'd like you to describe the organization of the processor (using terms from the lecture, not Intel terms). For example, what is the basic processor building block? How many hardware threads does it support? What width of SIMD instructions are executed by those threads? Does it have superscalar execution capabilities? How many times is this block replicated for additional parallelism?
    • Consider your favorite data-parallel programming language, such as GLSL/HLSL shading languages, CUDA, OpenCL, ISPC, or just an OpenMP #pragma parallel for. Can you think through how an embarrassingly "parallel for" loop can be mapped to this architecture. (You don't need to write this down, but you could if you wish.)
    • For those that want to go futher, I also encourage you to read NVIDIA's V100 (Volta) Architecture whitepaper, linked in the "further reading" below. Can you put the organization of this GPU in correspondence with the organization of the Intel GPU? You could make a table contrasting the features of a modern AVX-capable Intel CPU, Intel Integrated Graphics (Gen9), NVIDIA GPUs, etc.
  • What Makes a Graphics Systems Paper Beautiful. Fatahalian (2019)

    • A major theme of this course is "thinking like a systems architect". This short blog post discusses how systems artitects think about the intellectual merit and evaluation of systems. Read the blog post, and click through to some of the paper links. These are the types of issues, and the types of systems, we will be discussing in this class.
    • If you want to read ahead, give yourself some practice with identifying "goals and constraints" by looking at sections 1 and 2 of Google's paper Burst Photography for High Dynamic Range and Low-Light Imaging on Mobile Cameras. What were the goals and constraints underlying the design of the camera application in Google Pixel smartphones?

Other Recommended Readings:

  • Volta: Programmability and Performance. Hot Chips 29 (2017)
    • This Hot Chips presentation documents features in NVIDIA Volta GPU. Take a good look at how a chip is broken down into 80 streaming multi-processors (SMs), and that each SM can issue up to 4 warp instructions per clock, and supports up to concurrent 64 warps. You may also want to look at the NVIDIA Volta Whitepaper.
  • The Story of ISPC. Pharr (2018)
    • Matt Pharr's multi-part blog post is an riveting description of the history of ISPC, a simple, and quite useful, language and compiler for generating SIMD code for modern CPUs from a SPMD programming model. ISPC was motivated by the frustration that the SPMD programming benefits of CUDA and GLSL/HLSL on GPUs could easily be realized on CPUs, provided applications were written in a simpler, constrained programming system that did not have all the analysis challenges of a language like C/C++.
  • Scalability! But at What COST? McSherry, Isard, and Murray. HotOS 2015
    • The arguments in this paper are very consistent with the way we think about performance in the visual computing domain. In other words, efficiency and raw performance are different than "scalable".

Lecture 2: Digital Camera Processing Pipeline Basics

Post-Lecture Required Readings:

  • Burst Photography for High Dynamic Range and Low-light Imaging on Mobile Cameras. Hasinoff et al. SIGGRAPH Asia 2016
    • This is a very technical paper. But don't worry, your job is not to understand all the technical details of the algorithms, it is to approach the paper with a systems mindset, and think about the end-to-end considerations that went into the particular choice of algorithms. In general, I want you to pay the most attention to Section 1, Section 4.0 (you can ignore the detailed subpixel alignment in 4.1), Section 5 (I will talk about why merging is done in the Fourier domain in class), and Section 6. Specifically, as you read this paper, I'd like you think about the following issues:
    • Any good system typically has a philosophy underlying its design. This philosophy serves as a framework for which the system architect determines whether design decisions are good or bad, consistent with principles or not, etc. Page 2 of the paper clearly lists some of the principles that underlie the philosophy taken by the creators of the camera processing pipeline at Google. For each of the four principles, given an assessment of why the principle is important.
    • The main technical idea of this paper is to combine a sequence of similarly underexposed photos, rather than attempt to combine a sequence of photos with different exposures (the latter is called “bracketing”). What are the arguments in favor of the chosen approach? Appeal to the main system design principles. By the way, you can learn more about bracketing at these links.
    • Designing a good system is about meeting design goals, subject to certain constraints. (If there were no constraints, it would be easy to use unlimited resources to meet the goals.) What are the major constraints of the system? For example are there performance constraints? Usability constraints? Etc.
    • Why is the motivation for the weighted merging process described in Section 5? Why did the authors no use the far simpler approach of just adding up all the aligned images? (Again, can you appeal to the stated design principles?)
    • Finally, take a look at the “finishing” steps in Section 6. Many of those steps should sound familiar to you after today’s lecture.

Other Recommended Readings:

Lecture 3: Digital Camera Processing Pipeline Basics

Post-Lecture Required Readings:

  • The Frankencamera: An Experimental Platform for Computational Photography. A. Adams et al. SIGGRAPH 2010
    • Frankencamera was a paper written right about the time mobile phone cameras were becoming “acceptable” in quality, phones were beginning to contain a non-trivial amount of compute power, and computational photography papers we’re an increasingly hot topic in the SIGGRAPH community. At this time many compelling image processing and editing techniques were being published, and many of them revolved around generating high quality photographs from a sequence of multiple shots or exposures. However, current cameras at the time provided a very poor API to the camera hardware and its components. In short, many of the pieces were there for a programmable camera platform to be built, but someone had to attempt to architect a coherent system to make them accessible. Frankencamera was an attempt to do that: It involved two things:
      • The design of an API for programming cameras (a mental model of an abstract programmable camera architecture).
      • And two implementations of that architecture: an open camera reference design, and an implementation on a Nokia smartphone.
    • When you read the paper, we’re going to focus on the abstract architecture presented by a Frankencamera. Specifically I’d like you to think about the following:
      1. I’d like you to describe the major pieces of the Frankcamera abstract machine (the system’s nouns): e.g., devices, sensors, processors, etc.
      2. Then describe the major operations the machine could perform (the system’s verbs). In other words, would you say a “shot” is a command to the machine? Or is a shot a set of commands? Would you say the word “timeline” be a good word to use to describe a “shot”?
      3. What output does executing a shot generate? How is a frame different from a shot? Why is this distinction made by the system?
      4. Would you say that F-cam is a “programmable” camera architecture or a “configurable architecture”. What kinds of “programs” does the abstract machine run? (Note: see question 2)
      5. How would you characterize the particular type of computational photography algorithms that F-cam seeks to support/facilitate/enable (provides value for)?
    • Students may be interested that vestiges of ideas from the Frankencamera can now be seen in the Android Camera2 API: https://developer.android.com/reference/android/hardware/camera2/package-summary

Other Recommended Readings:

Lecture 4: Efficiently Scheduling Image Processing Algorithms

Post-Lecture Required Readings: (2)

  • Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. Ragan-Kelley, Adams, et al. PLDI 2013

    • Note: Alternatively you may read the selected chapters in the Ragan-Kelley thesis linked below in recommended readings. The thesis chapters involve a little more reading than the paper, but it is a more accessible explanation of the topic, so I recommend it for students.
    • In reading this paper, I want you to specifically focus on describing the philosophy of Halide. Specifically, if we ignore the "autotuner" described in Section 5 of the paper, what is the role of the programmer, and what is the role of the Halide system/compiler?
      • Hint 1: Which component is responsible for major optimization decisions?
      • Hint 2: Can a change to a schedule change the output of a Halide program?
    • Who do you think is the type of programmer targeted by Halide? Novices? Experts? Etc.?
    • Advanced question: In my opinion, there is one major place where the core design philosophy of Halide is violated. It is described in Section 4.3 in the paper, but is more clearly described in Section 8.3 of the Ph.D. thesis. (see sliding window optimizations and storage folding). Why do you think am I claiming this compiler optimization is a significant departure from the core principles of Halide? (there are also valid arguments against my opinion.)
      • Hint: what aspects of the program’s execution is not explicitly described in the schedule in these situations?
  • Learning to Optimize Halide with Tree Search and Random Programs. Adams et al. SIGGRAPH 2019

    • This paper documents the design of the autoscheduling algorithm that is not implemented in the Halide compiler. This is quite a technical paper, so I recommend that you adopt the "coarse to fine" reading structure that we discussed in class. Your goal is to get the big points of the paper, not all the details.
    • The back-tracking tree search used in this paper is certainly not a new idea (you've probably implemented algorithms like this in an introductory AI class), but what was interesting was the way the authors formulated the scheduling problem as a sequence of choices that could be optimized using tree search. Summarize how scheduling is modeled as a sequence of choices?
      • Note: one detail you might be interested to take a closer look at is the "coarse-to-fine refinement" part of Section 3.2. This is a slight modification to a standard backtracking tree search.
    • An optimizer's goal is to minimize a cost. In the case of this paper, the cost is the runtime of the scheduled program. Why is a machine learned model used to predict the scheduled program's runtime? Why not just compile the program and run it on a machine?
    • The other interesting part of this paper is the engineering of the learned cost model. This was surprisingly difficult. Observe that the authors do not present an approach based on end-to-end learning where the input is a Halide program DAG and the output is an estimated cost, instead they use compiler analysis to compute a collection of program features, and then what is learned is how to weight these features in estimating cost (See Section 4.2). For those of you with a bit of deep learning background, I'm interested in your thoughts here. Do you like the hand-engineered features approach?

Other Recommended Readings:

Lecture 5: Efficient DNN Inference (Software Techniques)

Post-Lecture Required Reading:

  • In-Datacenter Performance Analysis of a Tensor Processing Unit. Jouppi et al. ISCA 2017
    • Like many computer architecture papers, the TPU paper includes a lot of facts about details of the system. I encourage you to understand these details, but try to look past all the complexity and try and look for the main lessons learned. (motivation, key constraints, key principles in the design). Here are the questions I'd like to see you address.
    • What was the motivation for Google to seriously consider the use of a custom processor for accelerating DNN computations in their datacenters, as opposed to using CPUs or GPUs? (Section 2)
    • I'd like you to resummarize how the matrix_multiply operation works. More precisely, can you flesh out the details of how the TPU carries out the work described in this sentence at the bottom of page 3: "A matrix operation takes a variable-sized B256 input, multiplies it by a 256x256 constant weight input, and produces a B256 output, taking B pipelined cycles to complete".
    • I'd like to talk about the "roofline" charts in Section 4. These graphs plot the max performance of the chip (Y axis) given a program with an arithmetic intensity (X -- ratio of math operations to data access). How are these graphs used to assess the performance of the TPU and to characterize the workload run on the TPU?
    • Section 8 (Discussion) of this paper is an outstanding example of good architectural thinking. Make sure you understand the points in this section as we'll discuss a number of them in class. Particularly for us in this class, what is the point of the bullet "Pitfall: Architects have neglected important NN tasks."?

Strongly, Strongly Recommended Readings:

Other Recommended Readings:

Lecture 6: DNN Hardware Accelerators

Post-Lecture Required Reading:

  • There was no post-lecture required reading. (We read the TPU paper the last time, so take a break!)

Other Recommended Readings:

Lecture 7: Parallel DNN Training

Post-Lecture Required Reading:

  • There are two required readings post Lecture 7, but they are pre-reading for Lecture 8, so please see the readings listed under Lecture 8.

Other Recommended Readings:

Lecture 8: Raising the Level of Abstraction for Model Creation

Pre-Lecture Required Reading:

Often when you hear about machine learning abstractions, we think about ML frameworks like PyTorch, TensorFlow, or MX.Net. Instead of having to write key model ML layers yourself (i.e., like I force you to do Assignment 2), these frameworks present the abstraction of a ML operator, and allow model creation by composing operators into DAGs. However, the abstraction of designing models by wiring up data flow graphs of operators is still quite low. One might characterize these abstractions as being targeted for an ML engineer---someone who has taken a lot of ML classes, and has experience implementing model architectures in TensorFlow, experience selecting the right model for the job, or with the know-how to adjust hyperparameters to make training successful.

The two papers for tomorrow’s discussion begin to raise the level of abstraction even higher. These are systems that emerged out of two major companies (Overton out of Apple, and Ludwig out of Uber), and they share the underlying philosophy that some of the operational details of getting modern ML to work can be abstracted away from users that simply want to use technologies to quickly train, validate, and continue to maintain accurate models. In short these two systems can be thought of as different takes on Karpathy’s software 2.0 argument, which you can read in this Medium blog post. I’m curious about your thoughts on this post as well!

When reading these papers, please consider the following:

  • A good system provides valuable services to the user. So in these papers, who is the “user” (what is their goal, what is their skillset?) and what are the painful, hard, or tedious things that the systems are designed to do for the user?

  • Another way we can think about these papers is that they are taking a position that existing systems are helping users with the wrong problem. What types of problems are these systems really trying to help with (hint: do you think they are more geared toward design of new ML model architectures, or getting the right training data into the system?)

  • The following two (very similar) statements appear in the papers. First, what is the value of this separation? Or at least what is the future promise of this separation? (what system services does it enable?)

    • Overton: "Informally, the schema defines what the model computes but not how the model computes it."
    • Ludwig: "The higher level of abstraction provided by the type-based ECD architecture allows for a separation between what a model is expected to learn to do and how it actually does it."
  • Following up on the previous question: Do you buy the claim that Ludwig truly separates what a model is expected to learn and how it learns it? It seems like the user specifies a good bit about the dataflow of the solution. What are your thoughts? (It seems like Overton's abstractions are a lot closer to really "zero code" model design.)

  • Let’s specifically contrast the abstractions of Ludwig with that of a lower-level ML system like TensorFlow. TensorFlow/MX.Net/PyTorch largely abstract ML model definition as a DAG of N-Tensor operations. How is Ludwig different? What are the operators and what are the data-types exchanged by operators? What is the value of having richer types than just forcing all input/output data to be an N-D tensor?

Other Recommended Readings:

Lecture 9: System Support for Curating Training Data

Pre-Lecture Required Reading:

  • Snorkel: Rapid Training Data Creation with Weak Supervision. Ratner et al. VLDB 2017.
    • First let's get our terminology straight. What is mean by "weak supervision", and how does it differ from the traditional supervised learning scenario where a training procedure is provided a training set consisting of a set of data and a corresponding set of ground truth labels?
    • Like in all systems, I'd like everyone to pay particular attention to the design principles described in section 1. Note that you also may wish to simultaneously read the Snorkel DryBell paper in the suggested readings below as it has an amended list of principles after deploying Snrokel at Google. If you had to boil the entire philosophy of Snorkel down to once thing, what would you say it is... hint: look at principle 1 in the Snorkel paper.
    • The main abstraction in Snorkell is the labeling function. Please describe what the output interface of a labeling function is. Then, and most importantly, what is the value of this abstraction. Hint: you probably want to refer to the key principle of Snorkel.
    • What is the role of the final model training part of Snorkel? (training an off-the-shelf architecture on the supervisionn produced by Snorkel.) Why not just use the probablistic labels as the model itself?
    • One interesting aspect of Snorkel is the notion of learning from non-servable assets. This is definitely not an issue that would be high on the list of academic concerns, but is quite important. (This is perhaps more clearly articulated in the Snorkel DryBell paper, so take a look there).
    • In general, I'd like you to reflect on Snorkel and (if time) some of the recommended papers below. (see the Rekall blog post, or the Model assertions paper for different takes.) I'm curious about your comments.
    • From the ML perspective, the key technical insight of Snorkel (and of most follow on Snorkel papers, see here, here, and here for some examples) is the mathematical modeling of correlations between labeling functions in order to more accurately estimate probabilistic labels. We will not talk in detail about these generative algorithms in class, but many of you will enjoy learning about them.

Other Recommended Readings:

Lecture 10: Specialization for Efficienct Inference on Video Streams

Pre-Lecture Required Reading:

  • Online Model Distillation for Efficient Video Inference, Mullapudi et al. ICCV 2019
    • In your words, give an explanation for why the smaller (and cheaper) DNN model is able to generate similarly high quality output as the much larger, and more expensive Mask R-CNN segmentation model.
    • Rather than think about this paper as a paper about DNN efficiency optimization, I'd like you to think about this paper in the context of the previous class: it's a paper about acquiring supervision. There are two parts to this: what data (specifically, what video frames) is the right data to train on, and how does the system obtain supervision for the those frames. Let's break this into two parts:
      • One way to curate a dataset for a specific video stream is to sample a large amount of data from the stream. And I'm sure if you worked hard enough, you could curate a good dataset. What are potential pitfalls with this approach? What is the key idea that allows this paper get around these problems?
      • One of the challenges of online model distillation is that the system much operate at time scales that prevent human labeling from being the source of supervision. Using a more expensive (and more trustworthy) model is what was used here, but there was one more trick. (see the second paragraph of 3.2)
    • Most people read this paper and ask the same question about a potential failure modes. What do you think that is? (Hint: consider walking around a corner to a place you have been before.) Do you have ideas for a potential fix? (Hint: this would be a great class project and probably a computer vision conference paper.)
    • Take a few sentences to reflect on your opinion of the central philosophy of the paper (as well as the No Scope suggested reading). While most academic machine learning work is attempting to make the more general models possible (general is better!), these works are suggesting life perhaps can be better by embracing the fact that models can still be effective even if they are very specific. Is this just a systems-centric hack? What's your opinion? It might be good to think about your answer in the context of last class' discussion about the challenges of acquiring both good training data and good validation data for models.

Other Recommended Readings:

Lecture 11: Video Compression

Post-Lecture Required Reading:

  • Encoding, Fast and Slow: Low-Latency Video Processing Using Thousands of Tiny Threads. Fouladi et al. NSDI 17
    • This paper was one of those papers that me go, "wow that's cool, its amazing someone hasn't thought about that before!" Please consider the following in your reading response:
    • This is mainly an algorithms paper, but it's set in the context of systems thinking about the possibilities of what can be done if an application can rent a large number of cores (almost) instanteously, and only use each those cores for a few seconds. The benchmarks in Section 2 (see paragraph entitled "Cold and warm start") tease out what I mean by "almost instanteous". It's not directly stated in the paper, but why do you think the authors observe system behavior where not all the cores they request are available immediately?
    • The paper gives a good review of the video encoding process that we discussed in class. To review, which part of the process is the "slow part" that the author's aim to parallelize? Why is it so expensive?
    • One solution to use N workers is to just chop the video into N segments, and have each worker serially run a video compression algorithm on each chunk, then concatenate the resulting videos. Why is this deemed insufficient by the authors? (hint compression ratio)
    • The key aspect of the encoding algorithm is the rebase operation? In short describe how rebase works? Why is rebasing needed? And why is rebasing fast?
    • Do you think the same parallel computing infastructure used in this paper would be good for reducing the latency of DNN training? Why or why not? Name at least one application we've dscribed in this course that might be a good candidate for this platform.

Other Recommended Readings:

Lecture 12: Additional Video Processing Topics

No required readings for this lecture.

Recommended Readings:

Lecture 13: The Real-Time Graphics Pipeline

No required readings for this lecture.

Recommended Readings:

  • A Trip Down the LOL Graphics Pipeline. A nice introductory blog post for Riot Games that illustrates all the different rendering passes used to construct a League of Legends scene. Note how each of these passes draws geometry under different graphics pipeline state configurations.
  • [A Trip Down the Graphics Pipeline](A Trip Down the Graphics Pipeline). A much more detailed blog post by Fabian Giesen describing the Direct3D 10-class pipeline
  • The Design of the OpenGL Graphics Interface. M. Segal and K. Akeley. [unpublished 1994]
  • The Direct3D 10 System. D. Blythe. SIGGRAPH 2006

Lecture 14: Scheduling The Graphics Pipeline onto GPU Hardware

No required readings for this lecture.

Recommended Readings:

Lecture 15: Domain-Specific Languages for Shading

Pre-Lecture Required Reading:

  • A Language for Shading and Lighting Calculations. P. Hanrahan and J. Lawson. SIGGRAPH 1990
    • This paper is a domain specific language for describing shading calculations. For those that are not familiar with basic rendering algorithms from a class like CS248, before reading this paper, you’ll likely need to read through these notes that explain the role of shading and lighting computations in computer graphics. In particular, make sure you understand the rendering equation which is a fundamental equation for computing how much light bounces off a surface. A shader is a program that computes this value.
    • A big part of a domain-specific language is that it constrains programs to have a certain structure. I’d like you to describe the structure enforced by RSL domain-specific language.
      • What are surface shaders and what do they compute?
      • What are light shaders and what do they compute?
      • How do surface and light shaders interact through illuminance loops?
      • What is the correspondence between this structure and the rendering equation that is being simulated by the program?
    • Section 3.2 describes the concept of uniform and varying variables. How do these two types of variables differ? And what is the motivation for differentiating between uniform and varying?
    • Sections 4 and 5 are worth reading for those interested (they focus on state management), but we’ll focus the discussion on Sections 1-3.
  • Cg: A System for Programming Graphics Hardware in a C-like Language. W. R. Mark et al. SIGGRAPH 2003
    • This paper is about the first programming language for GPUs (this was pre-CUDA). However it came much later than the Renderman Shading language from the first paper.
    • I’d claim that Renderman Shading Language is indeed a domain-specific language for shading computations. But would you say the same about Cg?
    • The paper described the thinking behind a number of big decisions made in the design of Cg. I’d like your opinion on what you think is the most interesting design decision the authors' made.
    • In your reading, pay close attention to design goals and design constraints. We’ll talk about the implications of these goals and constraints in class. In your writeup, please comment on what you think is the most interesting constraint.

Other Recommended Readings:

Lecture 16: Architecture Support for Ray Tracing

Other Recommended Readings:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.