Website: http://cs231n.stanford.edu/
Lectures link: https://www.youtube.com/playlist?list=PLC1qU-LWwrF64f4QKQT-Vg5Wr4qEE1Zxk
Full syllabus link: http://cs231n.stanford.edu/syllabus.html
Assignments solutions: https://github.com/Burton2000/CS231n-2017
Number of lectures: 16
Course description:
- Computer Vision has become ubiquitous in our society, with applications in search, image understanding, apps, mapping, medicine, drones, and self-driving cars. Core to many of these applications are visual recognition tasks such as image classification, localization and detection. Recent developments in neural network (aka “deep learning”) approaches have greatly advanced the performance of these state-of-the-art visual recognition systems. This course is a deep dive into details of the deep learning architectures with a focus on learning end-to-end models for these tasks, particularly image classification. During the 10-week course, students will learn to implement, train and debug their own neural networks and gain a detailed understanding of cutting-edge research in computer vision. The final assignment will involve training a multi-million parameter convolutional neural network and applying it on the largest image classification dataset (ImageNet). We will focus on teaching how to set up the problem of image recognition, the learning algorithms (e.g. backpropagation), practical engineering tricks for training and fine-tuning the networks and guide the students through hands-on assignments and a final course project. Much of the background and materials of this course will be drawn from the ImageNet Challenge.

01. Introduction to CNN for visual recognition

A brief history of Computer vision starting from the late 1960s to 2017.
Computer vision problems includes image classification, object localization, object detection, and scene understanding.
Imagenet is one of the biggest datasets in image classification available right now.
Starting 2012 in the Imagenet competition, CNN (Convolutional neural networks) is always winning.
CNN actually has been invented in 1997 by Yann Lecun.

02. Image classification

Image classification problem has a lot of challenges like illumination and viewpoints.
An image classification algorithm can be solved with K nearest neighborhood (KNN) but it can poorly solve the problem. The properties of KNN are:
- Hyperparameters of KNN are: k and the distance measure
- K is the number of neighbors we are comparing to.
- Distance measures include:
  - L2 distance (Euclidean distance)
    - Best for non coordinate points
  - L1 distance (Manhattan distance)
    - Best for coordinate points
Hyperparameters can be optimized using Cross-validation as following (In our case we are trying tp predict K):
1. Split your dataset into f folds.
2. Given predicted hyperparameters:
  - Train your algorithm with f-1 folds and test it with the remain flood. and repeat this with every fold.
3. Choose the hyperparameters that gives the best training values (Average over all folds)
Linear SVM classifier is an option for solving the image classification problem, but the curse of dimensions makes it stop improving at some point.
Logistic regression is a also a solution for image classification problem, but image classification problem is non linear!
Linear classifiers has to run the following equation: Y = wX + b
- shape of w is the same as x and shape of b is 1.
We can add 1 to X vector and remove the bias so that: Y = wX
- shape of x is oldX+1 and w is the same as x
We need to know how can we get w’s and b’s that makes the classifier runs at best.

03. Loss function and optimization

In the last section we talked about linear classifier but we didn’t discussed how we could train the parameters of that model to get best w’s and b’s.
We need a loss function to measure how good or bad our current parameters.

Loss = L[i] =(f(X[i],W),Y[i])
Loss_for_all = 1/N * Sum(Li(f(X[i],W),Y[i]))      # Indicates the average

Then we find a way to minimize the loss function given some parameters. This is called optimization.
Loss function for a linear SVM classifier:
- L[i] = Sum where all classes except the predicted class (max(0, s[j] - s[y[i]] + 1))
- We call this the hinge loss.
- Loss function means we are happy if the best prediction are the same as the true value other wise we give an error with 1 margin.
- Example:
  - Given this example we want to compute the loss of this image.
  - L = max (0, 437.9 - (-96.8) + 1) + max(0, 61.95 - (-96.8) + 1) = max(0, 535.7) + max(0, 159.75) = 695.45
  - Final loss is 695.45 which is big and reflects that the cat score needs to be the best over all classes as its the lowest value now. We need to minimize that loss.
- Its OK for the margin to be 1. But its a hyperparameter too.
If your loss function gives you zero, are this value is the same value for your parameter? No there are a lot of parameters that can give you best score.
You’ll sometimes hear about people instead using the squared hinge loss SVM (or L2-SVM). that penalizes violated margins more strongly (quadratically instead of linearly). The unsquared version is more standard, but in some datasets the squared hinge loss can work better.
We add regularization for the loss function so that the discovered model don’t overfit the data.

Loss = L = 1/N * Sum(Li(f(X[i],W),Y[i])) + lambda * R(W)

Where R is the regularizer, and lambda is the regularization term.
There are different regularizations techniques:

Regularizer	Equation	Comments
L2	`R(W) = Sum(W^2)`	Sum all the W squared
L1	`R(W) = Sum(lWl)`	Sum of all Ws with abs
Elastic net (L1 + L2)	`R(W) = beta * Sum(W^2) + Sum(lWl)`
Dropout		No Equation

Regularization prefers smaller Ws over big Ws.
Regularizations is called weight decay. biases should not included in regularization.
Softmax loss (Like linear regression but works for more than 2 classes):
- Softmax function:

* ```python
  A[L] = e^(score[L]) / sum(e^(score[L]), NoOfClasses)
  ```
*

Sum of the vector should be 1.
Softmax loss:

* ```python
  Loss = -logP(Y = y[i]|X = x[i])
  ```
*



* Log of the probability of the good class. We want it to be near 1 thats why we added a minus.
* Softmax loss is called cross-entropy loss.

Consider this numerical problem when you are computing Softmax:

* ```python
  f = np.array([123, 456, 789]) # example with 3 classes and each having large scores
  p = np.exp(f) / np.sum(np.exp(f)) # Bad: Numeric problem, potential blowup

  # instead: first shift the values of f so that the highest number is 0:
  f -= np.max(f) # f becomes [-666, -333, 0]
  p = np.exp(f) / np.sum(np.exp(f)) # safe to do, gives the correct answer
  ```
*

Optimization:
- How we can optimize loss functions we discussed?
- Strategy one:
  - Get a random parameters and try all of them on the loss and get the best loss. But its a bad idea.
- Strategy two:
  - Follow the slope.
    - Image source.
  - Our goal is to compute the gradient of each parameter we have.
    - Numerical gradient: Approximate, slow, easy to write. (But its useful in debugging.)
    - Analytic gradient: Exact, Fast, Error-prone. (Always used in practice)
  - After we compute the gradient of our parameters, we compute the gradient descent:

  * ```python
    W = W - learning_rate * W_grad
    ```
  *



* learning_rate is so important hyper parameter you should get the best value of it first of all the hyperparameters.
* stochastic gradient descent:
  * Instead of using all the date, use a mini batch of examples (32/64/128 are commonly used) for faster results.

04. Introduction to Neural network

Computing the analytic gradient for arbitrary complex functions:
- What is a Computational graphs?
  - Used to represent any function. with nodes.
  - Using Computational graphs can easy lead us to use a technique that called back-propagation. Even with complex models like CNN and RNN.
- Back-propagation simple example:
  - Suppose we have f(x,y,z) = (x+y)z
  - Then graph can be represented this way:

* ```
  X         
    \
     (+)--> q ---(*)--> f
    /           /
  Y            /
              /
             /
  Z---------/
  ```
*



* We made an intermediate variable `q` to hold the values of `x+y`
* Then we have:

  *


  * ```python
    q = (x+y)              # dq/dx = 1 , dq/dy = 1
    f = qz                 # df/dq = z , df/dz = q
    ```
  *



* Then:

  *


  * ```python
    df/dq = z
    df/dz = q
    df/dx = df/dq * dq/dx = z * 1 = z       # Chain rule
    df/dy = df/dq * dq/dy = z * 1 = z       # Chain rule
    ```
  *

So in the Computational graphs, we call each operation f. For each f we calculate the local gradient before we go on back propagation and then we compute the gradients in respect of the loss function using the chain rule.
In the Computational graphs you can split each operation to as simple as you want but the nodes will be a lot. if you want the nodes to be smaller be sure that you can compute the gradient of this node.
A bigger example:
- Hint: the back propagation of two nodes going to one node from the back is by adding the two derivatives.
Modularized implementation: forward/ backward API (example multiply code):

* ```python
  class MultuplyGate(object):
    """
    x,y are scalars
    """
    def forward(x,y):
      z = x*y
      self.x = x  # Cache
      self.y = y	# Cache
      # We cache x and y because we know that the derivatives contains them.
      return z
    def backward(dz):
      dx = self.y * dz         #self.y is dx
      dy = self.x * dz
      return [dx, dy]
  ```
*

If you look at a deep learning framework you will find it follow the Modularized implementation where each class has a definition for forward and backward. For example:
- Multiplication
- Max
- Plus
- Minus
- Sigmoid
- Convolution
So to define neural network as a function:
- (Before) Linear score function: f = Wx
- (Now) 2-layer neural network: f = W2*max(0,W1*x)
  - Where max is the RELU non linear function
- (Now) 3-layer neural network: f = W3*max(0,W2*max(0,W1*x)
- And so on..
Neural networks is a stack of some simple operation that forms complex operations.

05. Convolutional neural networks (CNNs)

Neural networks history:
- First perceptron machine was developed by Frank Rosenblatt in 1957. It was used to recognize letters of the alphabet. Back propagation wasn’t developed yet.
- Multilayer perceptron was developed in 1960 by Adaline/Madaline. Back propagation wasn’t developed yet.
- Back propagation was developed in 1986 by Rumeelhart.
- There was a period which nothing new was happening with NN. Cause of the limited computing resources and data.
- In 2006 Hinton released a paper that shows that we can train a deep neural network using Restricted Boltzmann machines to initialize the weights then back propagation.
- The first strong results was in 2012 by Hinton in speech recognition. And the Alexnet “Convolutional neural networks” that wins the image net in 2012 also by Hinton’s team.
- After that NN is widely used in various applications.
Convolutional neural networks history:
- Hubel & Wisel in 1959 to 1968 experiments on cats cortex found that there are a topographical mapping in the cortex and that the neurons has hireical organization from simple to complex.
- In 1998, Yann Lecun gives the paper Gradient-based learning applied to document recognition that introduced the Convolutional neural networks. It was good for recognizing zip letters but couldn’t run on a more complex examples.
- In 2012 AlexNet used the same Yan Lecun architecture and won the image net challenge. The difference from 1998 that now we have a large data sets that can be used also the power of the GPUs solved a lot of performance problems.
- Starting from 2012 there are CNN that are used for various tasks (Here are some applications):
  - Image classification.
  - Image retrieval.
    - Extracting features using a NN and then do a similarity matching.
  - Object detection.
  - Segmentation.
    - Each pixel in an image takes a label.
  - Face recognition.
  - Pose recognition.
  - Medical images.
  - Playing Atari games with reinforcement learning.
  - Galaxies classification.
  - Street signs recognition.
  - Image captioning.
  - Deep dream.
ConvNet architectures make the explicit assumption that the inputs are images, which allows us to encode certain properties into the architecture.
There are a few distinct types of Layers in ConvNet (e.g. CONV/FC/RELU/POOL are by far the most popular)
Each Layer may or may not have parameters (e.g. CONV/FC do, RELU/POOL don’t)
Each Layer may or may not have additional hyperparameters (e.g. CONV/FC/POOL do, RELU doesn’t)
How Convolutional neural networks works?
- A fully connected layer is a layer in which all the neurons is connected. Sometimes we call it a dense layer.
  - If input shape is (X, M) the weighs shape for this will be (NoOfHiddenNeurons, X)
- Convolution layer is a layer in which we will keep the structure of the input by a filter that goes through all the image.
  - We do this with dot product: W.T*X + b. This equation uses the broadcasting technique.
  - So we need to get the values of W and b
  - We usually deal with the filter (W) as a vector not a matrix.
- We call output of the convolution activation map. We need to have multiple activation map.
  - Example if we have 6 filters, here are the shapes:
    - Input image (32,32,3)
    - filter size (5,5,3)
      - We apply 6 filters. The depth must be three because the input map has depth of three.
    - Output of Conv. (28,28,6)
      - if one filter it will be (28,28,1)
    - After RELU (28,28,6)
    - Another filter (5,5,6)
    - Output of Conv. (24,24,10)
- It turns out that convNets learns in the first layers the low features and then the mid-level features and then the high level features.
- After the Convnets we can have a linear classifier for a classification task.
- In Convolutional neural networks usually we have some (Conv ==> Relu)s and then we apply a pool operation to downsample the size of the activation.
What is stride when we are doing convolution:
- While doing a conv layer we have many choices to make regarding the stride of which we will take. I will explain this by examples.
- Stride is skipping while sliding. By default its 1.
- Given a matrix with shape of (7,7) and a filter with shape (3,3):
  - If stride is 1 then the output shape will be (5,5) # 2 are dropped
  - If stride is 2 then the output shape will be (3,3) # 4 are dropped
  - If stride is 3 it doesn’t work.
- A general formula would be ((N-F)/stride +1)
  - If stride is 1 then O = ((7-3)/1)+1 = 4 + 1 = 5
  - If stride is 2 then O = ((7-3)/2)+1 = 2 + 1 = 3
  - If stride is 3 then O = ((7-3)/3)+1 = 1.33 + 1 = 2.33 # doesn't work
In practice its common to zero pad the border. # Padding from both sides.
- Give a stride of 1 its common to pad to this equation: (F-1)/2 where F is the filter size
  - Example F = 3 ==> Zero pad with 1
  - Example F = 5 ==> Zero pad with 2
- If we pad this way we call this same convolution.
- Adding zeros gives another features to the edges thats why there are different padding techniques like padding the corners not zeros but in practice zeros works!
- We do this to maintain our full size of the input. If we didn’t do that the input will be shrinking too fast and we will lose a lot of data.
Example:
- If we have input of shape (32,32,3) and ten filters with shape is (5,5) with stride 1 and pad 2
  - Output size will be (32,32,10) # We maintain the size.
- Size of parameters per filter = 5*5*3 + 1 = 76
- All parameters = 76 * 10 = 76
Number of filters is usually common to be to the power of 2. # To vectorize well.
So here are the parameters for the Conv layer:
- Number of filters K.
  - Usually a power of 2.
- Spatial content size F.
  - 3,5,7 ….
- The stride S.
  - Usually 1 or 2 (If the stride is big there will be a downsampling but different of pooling)
- Amount of Padding
  - If we want the input shape to be as the output shape, based on the F if 3 its 1, if F is 5 the 2 and so on.
Pooling makes the representation smaller and more manageable.
Pooling Operates over each activation map independently.
Example of pooling is the maxpooling.
- Parameters of max pooling is the size of the filter and the stride"
  - Example 2x2 with stride 2 # Usually the two parameters are the same 2 , 2
Also example of pooling is average pooling.
- In this case it might be learnable.

06. Training neural networks I

As a revision here are the Mini batch stochastic gradient descent algorithm steps:
- Loop:
  1. Sample a batch of data.
  2. Forward prop it through the graph (network) and get loss.
  3. Backprop to calculate the gradients.
  4. Update the parameters using the gradients.
Activation functions:
- Different choices for activation function includes Sigmoid, tanh, RELU, Leaky RELU, Maxout, and ELU.
- Sigmoid:
  - Squashes the numbers between [0,1]
  - Used as a firing rate like human brains.
  - Sigmoid(x) = 1 / (1 + e^-x)
  - Problems with sigmoid:
    - big values neurons kill the gradients.
      - Gradients are in most cases near 0 (Big values/small values), that kills the updates if the graph/network are large.
    - Not Zero-centered.
      - Didn’t produce zero-mean data.
    - exp() is a bit compute expensive.
      - just to mention. We have a more complex operations in deep learning like convolution.
- Tanh:
  - Squashes the numbers between [-1,1]
  - Zero centered.
  - Still big values neurons “kill” the gradients.
  - Tanh(x) is the equation.
  - Proposed by Yann Lecun in 1991.
- RELU (Rectified linear unit):
  - RELU(x) = max(0,x)
  - Doesn’t kill the gradients.
    - Only small values that are killed. Killed the gradient in the half
  - Computationally efficient.
  - Converges much faster than Sigmoid and Tanh (6x)
  - More biologically plausible than sigmoid.
  - Proposed by Alex Krizhevsky in 2012 Toronto university. (AlexNet)
  - Problems:
    - Not zero centered.
  - If weights aren’t initialized good, maybe 75% of the neurons will be dead and thats a waste computation. But its still works. This is an active area of research to optimize this.
  - To solve the issue mentioned above, people might initialize all the biases by 0.01
- Leaky RELU:
  - leaky_RELU(x) = max(0.01x,x)
  - Doesn’t kill the gradients from both sides.
  - Computationally efficient.
  - Converges much faster than Sigmoid and Tanh (6x)
  - Will not die.
  - PRELU is placing the 0.01 by a variable alpha which is learned as a parameter.
- Exponential linear units (ELU):

* ```
  ELU(x) = { x                                           if x > 0
  		   alpah *(exp(x) -1)		                   if x <= 0
             # alpah are a learning parameter
  }
  ```
*



* It has all the benefits of RELU
* Closer to zero mean outputs and adds some robustness to noise.
* problems
  * `exp()` is a bit compute expensive.

Maxout activations:
- maxout(x) = max(w1.T*x + b1, w2.T*x + b2)
- Generalizes RELU and Leaky RELU
- Doesn’t die!
- Problems:
  - oubles the number of parameters per neuron
In practice:
- Use RELU. Be careful for your learning rates.
- Try out Leaky RELU/Maxout/ELU
- Try out tanh but don’t expect much.
- Don’t use sigmoid!
Data preprocessing:
- Normalize the data:

# Zero centered data. (Calculate the mean for every input).
# On of the reasons we do this is because we need data to be between positive and negative and not all the be negative or positive. 
X -= np.mean(X, axis = 1)

# Then apply the standard deviation. Hint: in images we don't do this.
X /= np.std(X, axis = 1)

To normalize images:
- Subtract the mean image (E.g. Alexnet)
  - Mean image shape is the same as the input images.
- Or Subtract per-channel mean
  - Means calculate the mean for each channel of all images. Shape is 3 (3 channels)
Weight initialization:
- What happened when initialize all Ws with zeros?
  - All the neurons will do exactly the same thing. They will have the same gradient and they will have the same update.
  - So if W’s of a specific layer is equal the thing described happened
- First idea is to initialize the w’s with small random numbers:

* ```python
  W = 0.01 * np.random.rand(D, H)
  # Works OK for small networks but it makes problems with deeper networks!
  ```
*



* The standard deviations is going to zero in deeper networks. and the gradient will vanish sooner in deep networks.

*


* ```python
  W = 1 * np.random.rand(D, H) 
  # Works OK for small networks but it makes problems with deeper networks!
  ```
*



* The network will explode with big numbers!

Xavier initialization:

* ```python
  W = np.random.rand(in, out) / np.sqrt(in)
  ```
*



* It works because we want the variance of the input to be as the variance of the output.
* But it has an issue, It breaks when you are using RELU.

He initialization (Solution for the RELU issue):

* ```python
  W = np.random.rand(in, out) / np.sqrt(in/2)
  ```
*



* Solves the issue with RELU. Its recommended when you are using RELU

Proper initialization is an active area of research.
Batch normalization:
- is a technique to provide any layer in a Neural Network with inputs that are zero mean/unit variance.
- It speeds up the training. You want to do this a lot.
  - Made by Sergey Ioffe and Christian Szegedy at 2015.
- We make a Gaussian activations in each layer. by calculating the mean and the variance.
- Usually inserted after (fully connected or Convolutional layers) and (before nonlinearity).
- Steps (For each output of a layer)
  5. First we compute the mean and variance^2 of the batch for each feature.
  6. We normalize by subtracting the mean and dividing by square root of (variance^2 + epsilon)
  - epsilon to not divide by zero
  1. Then we make a scale and shift variables: Result = gamma * normalizedX + beta
    - gamma and beta are learnable parameters.
    - it basically possible to say “Hey!! I don’t want zero mean/unit variance input, give me back the raw input - it’s better for me.”
    - Hey shift and scale by what you want not just the mean and variance!
- The algorithm makes each layer flexible (It chooses which distribution it wants)
- We initialize the BatchNorm Parameters to transform the input to zero mean/unit variance distributions but during training they can learn that any other distribution might be better.
- During the running of the training we need to calculate the globalMean and globalVariance for each layer by using weighted average.
- Benefits of Batch Normalization:
  - Networks train faster.
  - Allows higher learning rates.
  - helps reduce the sensitivity to the initial starting weights.
  - Makes more activation functions viable.
  - Provides some regularization.
    - Because we are calculating mean and variance for each batch that gives a slight regularization effect.
- In conv layers, we will have one variance and one mean per activation map.
- Batch normalization have worked best for CONV and regular deep NN, But for recurrent NN and reinforcement learning its still an active research area.
  - Its challengey in reinforcement learning because the batch is small.
Baby sitting the learning process
1. Preprocessing of data.
2. Choose the architecture.
3. Make a forward pass and check the loss (Disable regularization). Check if the loss is reasonable.
4. Add regularization, the loss should go up!
5. Disable the regularization again and take a small number of data and try to train the loss and reach zero loss.
  - You should overfit perfectly for small datasets.
6. Take your full training data, and small regularization then try some value of learning rate.
  - If loss is barely changing, then the learning rate is small.
  - If you got NAN then your NN exploded and your learning rate is high.
  - Get your learning rate range by trying the min value (That can change) and the max value that doesn’t explode the network.
7. Do Hyperparameters optimization to get the best hyperparameters values.
Hyperparameter Optimization
- Try Cross validation strategy.
  - Run with a few ephocs, and try to optimize the ranges.
- Its best to optimize in log space.
- Adjust your ranges and try again.
- Its better to try random search instead of grid searches (In log space)

07. Training neural networks II

Optimization algorithms:
- Problems with stochastic gradient descent:
  - if loss quickly in one direction and slowly in another (For only two variables), you will get very slow progress along shallow dimension, jitter along steep direction. Our NN will have a lot of parameters then the problem will be more.
  - Local minimum or saddle points
    - If SGD went into local minimum we will stuck at this point because the gradient is zero.
    - Also in saddle points the gradient will be zero so we will stuck.
    - Saddle points says that at some point:
      - Some gradients will get the loss up.
      - Some gradients will get the loss down.
      - And that happens more in high dimensional (100 million dimension for example)
    - The problem of deep NN is more about saddle points than about local minimum because deep NN has high dimensions (Parameters)
    - Mini batches are noisy because the gradient is not taken for the whole batch.
- SGD + momentum:
  - Build up velocity as a running mean of gradients:

* ```python
  # Computing weighted average. rho best is in range [0.9 - 0.99]
  V[t+1] = rho * v[t] + dx
  x[t+1] = x[t] - learningRate * V[t+1]
  ```
*



* `V[0]` is zero.
* Solves the saddle point and local minimum problems.
* It overshoots the problem and returns to it back.

Nestrov momentum:

* ```python
  dx = compute_gradient(x)
  old_v = v
  v = rho * v - learning_rate * dx
  x+= -rho * old_v + (1+rho) * v
  ```
*



* Doesn't overshoot the problem but slower than SGD + momentum

AdaGrad

* ```python
  grad_squared = 0
  while(True):
    dx = compute_gradient(x)
    
    # here is a problem, the grad_squared isn't decayed (gets so large)
    grad_squared += dx * dx			
    
    x -= (learning_rate*dx) / (np.sqrt(grad_squared) + 1e-7)
  ```
*

RMSProp

* ```python
  grad_squared = 0
  while(True):
    dx = compute_gradient(x)
    
    #Solved ADAgra
    grad_squared = decay_rate * grad_squared + (1-grad_squared) * dx * dx  
    
    x -= (learning_rate*dx) / (np.sqrt(grad_squared) + 1e-7)
  ```
*



* People uses this instead of AdaGrad

Adam
- Calculates the momentum and RMSProp as the gradients.
- It need a Fixing bias to fix starts of gradients.
- Is the best technique so far runs best on a lot of problems.
- With beta1 = 0.9 and beta2 = 0.999 and learning_rate = 1e-3 or 5e-4 is a great starting point for many models!
Learning decay
- Ex. decay learning rate by half every few epochs.
- To help the learning rate not to bounce out.
- Learning decay is common with SGD+momentum but not common with Adam.
- Dont use learning decay from the start at choosing your hyperparameters. Try first and check if you need decay or not.
All the above algorithms we have discussed is a first order optimization.
Second order optimization
- Use gradient and Hessian to from quadratic approximation.
- Step to the minima of the approximation.
- What is nice about this update?
  - It doesn’t has a learning rate in some of the versions.
- But its unpractical for deep learning
  - Has O(N^2) elements.
  - Inverting takes O(N^3).
- L-BFGS is a version of second order optimization
  - Works with batch optimization but not with mini-batches.
In practice first use ADAM and if it didn’t work try L-BFGS.
Some says all the famous deep architectures uses SGS + Nestrov momentum
Regularization
- So far we have talked about reducing the training error, but we care about most is how our model will handle unseen data!
- What if the gab of the error between training data and validation data are too large?
- This error is called high variance.
- Model Ensembles:
  - Algorithm:
    - Train multiple independent models of the same architecture with different initializations.
    - At test time average their results.
  - It can get you extra 2% performance.
  - It reduces the generalization error.
  - You can use some snapshots of your NN at the training ensembles them and take the results.
- Regularization solves the high variance problem. We have talked about L1, L2 Regularization.
- Some Regularization techniques are designed for only NN and can do better.
- Drop out:
  - In each forward pass, randomly set some of the neurons to zero. Probability of dropping is a hyperparameter that are 0.5 for almost cases.
  - So you will chooses some activation and makes them zero.
  - It works because:
    - It forces the network to have redundant representation; prevent co-adaption of features!
    - If you think about this, It ensemble some of the models in the same model!
  - At test time we might multiply each dropout layer by the probability of the dropout.
  - Sometimes at test time we don’t multiply anything and leave it as it is.
  - With drop out it takes more time to train.
- Data augmentation:
  - Another technique that makes Regularization.
  - Change the data!
  - For example flip the image, or rotate it.
  - Example in ResNet:
    - Training: Sample random crops and scales:
      1. Pick random L in range [256,480]
      2. Resize training image, short side = L
      3. Sample random 224x244 patch.
    - Testing: average a fixed set of crops
      4. Resize image at 5 scales: {224, 256, 384, 480, 640}
      5. For each size, use 10 224x224 crops: 4 corners + center + flips
    - Apply Color jitter or PCA
    - Translation, rotation, stretching.
- Drop connect
  - Like drop out idea it makes a regularization.
  - Instead of dropping the activation, we randomly zeroing the weights.
- Fractional Max Pooling
  - Cool regularization idea. Not commonly used.
  - Randomize the regions in which we pool.
- Stochastic depth
  - New idea.
  - Eliminate layers, instead on neurons.
  - Has the similar effect of drop out but its a new idea.
Transfer learning:
- Some times your data is overfitted by your model because the data is small not because of regularization.
- You need a lot of data if you want to train/use CNNs.
- Steps of transfer learning
  1. Train on a big dataset that has common features with your dataset. Called pretraining.
  2. Freeze the layers except the last layer and feed your small dataset to learn only the last layer.
  3. Not only the last layer maybe trained again, you can fine tune any number of layers you want based on the number of data you have
- Guide to use transfer learning:

* |                         | Very Similar dataset               | very different dataset                   |
  | ----------------------- | ---------------------------------- | ---------------------------------------- |
  | **very little dataset** | Use Linear classifier on top layer | You're in trouble.. Try linear classifier from different stages |
  | **quite a lot of data** | Finetune a few layers              | Finetune a large layers                  |
*

Transfer learning is the normal not an exception.

08. Deep learning software

This section changes a lot every year in CS231n due to rabid changes in the deep learning softwares.
CPU vs GPU
- GPU The graphics card was developed to render graphics to play games or make 3D media,. etc.
  - NVIDIA vs AMD
    - Deep learning choose NVIDIA over AMD GPU because NVIDIA is pushing research forward deep learning also makes it architecture more suitable for deep learning.
- CPU has fewer cores but each core is much faster and much more capable; great at sequential tasks. While GPUs has more cores but each core is much slower “dumber”; great for parallel tasks.
- GPU cores needs to work together. and has its own memory.
- Matrix multiplication is from the operations that are suited for GPUs. It has MxN independent operations that can be done on parallel.
- Convolution operation also can be paralyzed because it has independent operations.
- Programming GPUs frameworks:
  - CUDA (NVIDIA only)
    - Write c-like code that runs directly on the GPU.
    - Its hard to build a good optimized code that runs on GPU. Thats why they provided high level APIs.
    - Higher level APIs: cuBLAS, cuDNN, etc
    - CuDNN has implemented back prop. , convolution, recurrent and a lot more for you!
    - In practice you won’t write a parallel code. You will use the code implemented and optimized by others!
  - OpenCl
    - Similar to CUDA, but runs on any GPU.
    - Usually Slower .
    - Haven’t much support yet from all deep learning softwares.
- There are a lot of courses for learning parallel programming.
- If you aren’t careful, training can bottleneck on reading dara and transferring to GPU. So the solutions are:
  - Read all the data into RAM. # If possible
  - Use SSD instead of HDD
  - Use multiple CPU threads to prefetch data!
    - While the GPU are computing, a CPU thread will fetch the data for you.
    - A lot of frameworks implemented that for you because its a little bit painful!
Deep learning Frameworks
- Its super fast moving!
- Currently available frameworks:
  - Tensorflow (Google)
  - Caffe (UC Berkeley)
  - Caffe2 (Facebook)
  - Torch (NYU / Facebook)
  - PyTorch (Facebook)
  - Theano (U monteral)
  - Paddle (Baidu)
  - CNTK (Microsoft)
  - MXNet (Amazon)
- The instructor thinks that you should focus on Tensorflow and PyTorch.
- The point of deep learning frameworks:
  - Easily build big computational graphs.
  - Easily compute gradients in computational graphs.
  - Run it efficiently on GPU (cuDNN - cuBLAS)
- Numpy doesn’t run on GPU.
- Most of the frameworks tries to be like NUMPY in the forward pass and then they compute the gradients for you.
Tensorflow (Google)
- Code are two parts:
  1. Define computational graph.
  2. Run the graph and reuse it many times.
- Tensorflow uses a static graph architecture.
- Tensorflow variables live in the graph. while the placeholders are feed each run.
- Global initializer function initializes the variables that lives in the graph.
- Use predefined optimizers and losses.
- You can make a full layers with layers.dense function.
- Keras (High level wrapper):
  - Keras is a layer on top pf Tensorflow, makes common things easy to do.
  - So popular!
  - Trains a full deep NN in a few lines of codes.
- There are a lot high level wrappers:
  - Keras
  - TFLearn
  - TensorLayer
  - tf.layers #Ships with tensorflow
  - tf-Slim #Ships with tensorflow
  - tf.contrib.learn #Ships with tensorflow
  - Sonnet # New from deep mind
- Tensorflow has pretrained models that you can use while you are using transfer learning.
- Tensorboard adds logging to record loss, stats. Run server and get pretty graphs!
- It has distributed code if you want to split your graph on some nodes.
- Tensorflow is actually inspired from Theano. It has the same inspirations and structure.
PyTorch (Facebook)
- Has three layers of abstraction:
  - Tensor: ndarray but runs on GPU #Like numpy arrays in tensorflow
    - Variable: Node in a computational graphs; stores data and gradient #Like Tensor, Variable, Placeholders
  - Module: A NN layer; may store state or learnable weights#Like tf.layers in tensorflow
- In PyTorch the graphs runs in the same loop you are executing which makes it easier for debugging. This is called a dynamic graph.
- In PyTorch you can define your own autograd functions by writing forward and backward for tensors. Most of the times it will implemented for you.
- Torch.nn is a high level api like keras in tensorflow. You can create the models and go on and on.
  - You can define your own nn module!
- Also Pytorch contains optimizers like tensorflow.
- It contains a data loader that wraps a Dataset and provides minbatches, shuffling and multithreading.
- PyTorch contains the best and super easy to use pretrained models
- PyTorch contains Visdom that are like tensorboard. but Tensorboard seems to be more powerful.
- PyTorch is new and still evolving compared to Torch. Its still in beta state.
- PyTorch is best for research.
Tensorflow builds the graph once, then run them many times (Called static graph)
In each PyTorch iteration we build a new graph (Called dynamic graph)
Static vs dynamic graphs:
- Optimization:
  - With static graphs, framework can optimize the graph for you before it runs.
- Serialization
  - Static: Once graph is built, can serialize it and run it without the code that built the graph. Ex use the graph in c++
  - Dynamic: Always need to keep the code around.
- Conditional
  - Is easier in dynamic graphs. And more complicated in static graphs.
- Loops:
  - Is easier in dynamic graphs. And more complicated in static graphs.
Tensorflow fold make dynamic graphs easier in Tensorflow through dynamic batching.
Dynamic graph applications include: recurrent networks and recursive networks.
Caffe2 uses static graphs and can train model in python also works on IOS and Android
Tensorflow/Caffe2 are used a lot in production especially on mobile.

09. CNN architectures

This section talks about the famous CNN architectures. Focuses on CNN architectures that won ImageNet competition since 2012.
These architectures includes: AlexNet, VGG, GoogLeNet, and ResNet.
Also we will discuss some interesting architectures as we go.
The first ConvNet that was made was LeNet-5 architectures are:by Yann Lecun at 1998.
- Architecture are: CONV-POOL-CONV-POOL-FC-FC-FC
- Each conv filters was 5x5 applied at stride 1
- Each pool was 2x2 applied at stride 2
- It was useful in Digit recognition.
- In particular the insight that image features are distributed across the entire image, and convolutions with learnable parameters are an effective way to extract similar features at multiple location with few parameters.
- It contains exactly 5 layers
In 2010 Dan Claudiu Ciresan and Jurgen Schmidhuber published one of the very fist implementations of GPU Neural nets. This implementation had both forward and backward implemented on a a NVIDIA GTX 280 graphic processor of an up to 9 layers neural network.
AlexNet (2012):
- ConvNet that started the evolution and wins the ImageNet at 2012.
- Architecture are: CONV1-MAXPOOL1-NORM1-CONV2-MAXPOOL2-NORM2-CONV3-CONV4-CONV5-MAXPOOL3-FC6-FC7-FC8
- Contains exactly 8 layers the first 5 are Convolutional and the last 3 are fully connected layers.
- AlexNet accuracy error was 16.4%
- For example if the input is 227 x 227 x3 then these are the shapes of the of the outputs at each layer:
  - CONV1 (96 11 x 11 filters at stride 4, pad 0)
    - Output shape (55,55,96), Number of weights are (11*11*3*96)+96 = 34944
  - MAXPOOL1 (3 x 3 filters applied at stride 2)
    - Output shape (27,27,96), No Weights
  - NORM1
    - Output shape (27,27,96), We don’t do this any more
  - CONV2 (256 5 x 5 filters at stride 1, pad 2)
  - MAXPOOL2 (3 x 3 filters at stride 2)
  - NORM2
  - CONV3 (384 3 x 3 filters ar stride 1, pad 1)
  - CONV4 (384 3 x 3 filters ar stride 1, pad 1)
  - CONV5 (256 3 x 3 filters ar stride 1, pad 1)
  - MAXPOOL3 (3 x 3 filters at stride 2)
    - Output shape (6,6,256)
  - FC6 (4096)
  - FC7 (4096)
  - FC8 (1000 neurons for class score)
- Some other details:
  - First use of RELU.
  - Norm layers but not used any more.
  - heavy data augmentation
  - Dropout 0.5
  - batch size 128
  - SGD momentum 0.9
  - Learning rate 1e-2 reduce by 10 at some iterations
  - 7 CNN ensembles!
- AlexNet was trained on GTX 580 GPU with only 3 GB which wasn’t enough to train in one machine so they have spread the feature maps in half. The first AlexNet was distributed!
- Its still used in transfer learning in a lot of tasks.
- Total number of parameters are 60 million
ZFNet (2013)
- Won in 2013 with error 11.7%
- It has the same general structure but they changed a little in hyperparameters to get the best output.
- Also contains 8 layers.
- AlexNet but:
  - CONV1: change from (11 x 11 stride 4) to (7 x 7 stride 2)
  - CONV3,4,5: instead of 384, 384, 256 filters use 512, 1024, 512
OverFeat (2013)
- Won the localization in imageNet in 2013
- We show how a multiscale and sliding window approach can be efficiently implemented within a ConvNet. We also introduce a novel deep learning approach to localization by learning to predict object boundaries.
VGGNet (2014) (Oxford)
- Deeper network with more layers.
- Contains 19 layers.
- Won on 2014 with GoogleNet with error 7.3%
- Smaller filters with deeper layers.
- The great advantage of VGG was the insight that multiple 3 × 3 convolution in sequence can emulate the effect of larger receptive fields, for examples 5 × 5 and 7 × 7.
- Used the simple 3 x 3 Conv all through the network.
  - 3 (3 x 3) filters has the same effect as 7 x 7
- The Architecture contains several CONV layers then POOL layer over 5 times and then the full connected layers.
- It has a total memory of 96MB per image for only forward propagation!
  - Most memory are in the earlier layers
- Total number of parameters are 138 million
  - Most of the parameters are in the fully connected layers
- Has a similar details in training like AlexNet. Like using momentum and dropout.
- VGG19 are an upgrade for VGG16 that are slightly better but with more memory
GoogleNet (2014)
- Deeper network with more layers.
- Contains 22 layers.
- It has Efficient Inception module.
- Only 5 million parameters! 12x less than AlexNet
- Won on 2014 with VGGNet with error 6.7%
- Inception module:
  - Design a good local network topology (network within a network (NiN)) and then stack these modules on top of each other.
  - It consists of:
    - Apply parallel filter operations on the input from previous layer
      - Multiple convs of sizes (1 x 1, 3 x 3, 5 x 5)
        
        Adds padding to maintain the sizes.
      - Pooling operation. (Max Pooling)
        
        Adds padding to maintain the sizes.
    - Concatenate all filter outputs together depth-wise.
  - For example:
    - Input for inception module is 28 x 28 x 256
    - Then the parallel filters applied:
      - (1 x 1), 128 filter # output shape (28,28,128)
      - (3 x 3), 192 filter # output shape (28,28,192)
      - (5 x 5), 96 filter # output shape (28,28,96)
      - (3 x 3) Max pooling # output shape (28,28,256)
    - After concatenation this will be (28,28,672)
  - By this design -We call Naiveit has a big computation complexity.
    - The last example will make:
      - [1 x 1 conv, 128] ==> 28 * 28 * 128 * 1 * 1 * 256 = 25 Million approx
      - [3 x 3 conv, 192] ==> 28 * 28 * 192 *3 *3 * 256 = 346 Million approx
      - [5 x 5 conv, 96] ==> 28 * 28 * 96 * 5 * 5 * 256 = 482 Million approx
      - In total around 854 Million operation!
  - Solution: bottleneck layers that use 1x1 convolutions to reduce feature depth.
    - Inspired from NiN (Network in network)
  - The bottleneck solution will make a total operations of 358M on this example which is good compared with the naive implementation.
- So GoogleNet stacks this Inception module multiple times to get a full architecture of a network that can solve a problem without the Fully connected layers.
- Just to mention, it uses an average pooling layer at the end before the classification step.
- Full architecture:
- In February 2015 Batch-normalized Inception was introduced as Inception V2. Batch-normalization computes the mean and standard-deviation of all feature maps at the output of a layer, and normalizes their responses with these values.
- In December 2015 they introduced a paper “Rethinking the Inception Architecture for Computer Vision” which explains the older inception models well also introducing a new version V3.
The first GoogleNet and VGG was before batch normalization invented so they had some hacks to train the NN and converge well.
ResNet (2015) (Microsoft Research)
- 152-layer model for ImageNet. Winner by 3.57% which is more than human level error.
- This is also the very first time that a network of > hundred, even 1000 layers was trained.
- Swept all classification and detection competitions in ILSVRC’15 and COCO’15!
- What happens when we continue stacking deeper layers on a “plain” Convolutional neural network?
  - The deeper model performs worse, but it’s not caused by overfitting!
  - The learning stops performs well somehow because deeper NN are harder to optimize!
- The deeper model should be able to perform at least as well as the shallower model.
- A solution by construction is copying the learned layers from the shallower model and setting additional layers to identity mapping.
- Residual block:
  - Microsoft came with the Residual block which has this architecture:

* ```python
  # Instead of us trying To learn a new representation, We learn only Residual
  Y = (W2* RELU(W1x+b1) + b2) + X
  ```
*



* Say you have a network till a depth of N layers. You only want to add a new layer if you get something extra out of adding that layer.
* One way to ensure this new (N+1)th layer learns something new about your network is to also provide the input(x) without any transformation to the output of the (N+1)th layer. This essentially drives the new layer to learn something different from what the input has already encoded.
* The other advantage is such connections help in handling the Vanishing gradient problem in very deep networks.

With the Residual block we can now have a deep NN of any depth without the fearing that we can’t optimize the network.
ResNet with a large number of layers started to use a bottleneck layer similar to the Inception bottleneck to reduce the dimensions.
Full ResNet architecture:
- Stack residual blocks.
- Every residual block has two 3 x 3 conv layers.
- Additional conv layer at the beginning.
- No FC layers at the end (only FC 1000 to output classes)
- Periodically, double number of filters and downsample spatially using stride 2 (/2 in each dimension)
- Training ResNet in practice:
  - Batch Normalization after every CONV layer.
  - Xavier/2 initialization from He et al.
  - SGD + Momentum (0.9)
  - Learning rate: 0.1, divided by 10 when validation error plateaus
  - Mini-batch size 256
  - Weight decay of 1e-5
  - No dropout used.
Inception-v4: Resnet + Inception and was founded in 2016.
The complexity comparing over all the architectures:
- VGG: Highest memory, most operations.
- GoogLeNet: most efficient.
ResNets Improvements:
- (2016) Identity Mappings in Deep Residual Networks
  - From the creators of ResNet.
  - Gives better performance.
- (2016) Wide Residual Networks
  - Argues that residuals are the important factor, not depth
  - 50-layer wide ResNet outperforms 152-layer original ResNet
  - Increasing width instead of depth more computationally efficient (parallelizable)
- (2016) Deep Networks with Stochastic Depth
  - Motivation: reduce vanishing gradients and training time through short networks during training.
  - Randomly drop a subset of layers during each training pass
  - Use full deep network at test time.
Beyond ResNets:
- (2017) FractalNet: Ultra-Deep Neural Networks without Residuals
  - Argues that key is transitioning effectively from shallow to deep and residual representations are not necessary.
  - Trained with dropping out sub-paths
  - Full network at test time.
- (2017) Densely Connected Convolutional Networks
- (2017) SqueezeNet: AlexNet-level Accuracy With 50x Fewer Parameters and <0.5Mb Model Size
  - Good for production.
  - It is a re-hash of many concepts from ResNet and Inception, and show that after all, a better design of architecture will deliver small network sizes and parameters without needing complex compression algorithms.
Conclusion:
- ResNet current best default.
- Trend towards extremely deep networks
- In the last couple of years, some models all using the shortcuts like “ResNet” to eaisly flow the gradients.

10. Recurrent Neural networks

Vanilla Neural Networks “Feed neural networks”, input of fixed size goes through some hidden units and then go to output. We call it a one to one network.
Recurrent Neural Networks RNN Models:
- One to many
  - Example: Image Captioning
    - image ==> sequence of words
- Many to One
  - Example: Sentiment Classification
    - sequence of words ==> sentiment
- Many to many
  - Example: Machine Translation
    - seq of words in one language ==> seq of words in another language
  - Example: Video classification on frame level
RNNs can also work for Non-Sequence Data (One to One problems)
- It worked in Digit classification through taking a series of “glimpses”
  - “Multiple Object Recognition with Visual Attention”, ICLR 2015.
- It worked on generating images one piece at a time
  - i.e generating a captcha
So what is a recurrent neural network?
- Recurrent core cell that take an input x and that cell has an internal state that are updated each time it reads an input.
- The RNN block should return a vector.
- We can process a sequence of vectors x by applying a recurrence formula at every time step:

* ```python
  h[t] = fw (h[t-1], x[t])			# Where fw is some function with parameters W
  ```
*



* The same function and the same set of parameters are used at every time step.

(Vanilla) Recurrent Neural Network:

* ```
  h[t] = tanh (W[h,h]*h[t-1] + W[x,h]*x[t])    # Then we save h[t]
  y[t] = W[h,y]*h[t]
  ```
*



* This is the simplest example of a RNN.

RNN works on a sequence of related data.
Recurrent NN Computational graph:
- h0 are initialized to zero.
- Gradient of W is the sum of all the W gradients that has been calculated!
- A many to many graph:
  - Also the last is the sum of all losses and the weights of Y is one and is updated through summing all the gradients!
- A many to one graph:
- A one to many graph:
- sequence to sequence graph:
  - Encoder and decoder philosophy.
Examples:
- Suppose we are building words using characters. We want a model to predict the next character of a sequence. Lets say that the characters are only [h, e, l, o] and the words are [hello]
  - Training:
    - Only the third prediction here is true. The loss needs to be optimized.
    - We can train the network by feeding the whole word(s).
  - Testing time:
    - At test time we work with a character by character. The output character will be the next input with the other saved hidden activations.
    - This link contains all the code but uses Truncated Backpropagation through time as we will discuss.
Backpropagation through time Forward through entire sequence to compute loss, then backward through entire sequence to compute gradient.
- But if we choose the whole sequence it will be so slow and take so much memory and will never converge!
So in practice people are doing “Truncated Backpropagation through time” as we go on we Run forward and backward through chunks of the sequence instead of whole sequence
- Then Carry hidden states forward in time forever, but only backpropagate for some smaller number of steps.
Example on image captioning:
- They use token to finish running.
- The biggest dataset for image captioning is Microsoft COCO.
Image Captioning with Attention is a project in which when the RNN is generating captions, it looks at a specific part of the image not the whole image.
- Image Captioning with Attention technique is also used in “Visual Question Answering” problem
Multilayer RNNs is generally using some layers as the hidden layer that are feed into again. LSTM is a multilayer RNNs.
Backward flow of gradients in RNN can explode or vanish. Exploding is controlled with gradient clipping. Vanishing is controlled with additive interactions (LSTM)
LSTM stands for Long Short Term Memory. It was designed to help the vanishing gradient problem on RNNs.
- It consists of:
  - f: Forget gate, Whether to erase cell
  - i: Input gate, whether to write to cell
  - g: Gate gate (?), How much to write to cell
  - o: Output gate, How much to reveal cell
- The LSTM gradients are easily computed like ResNet
- The LSTM is keeping data on the long or short memory as it trains means it can remember not just the things from last layer but layers.
Highway networks is something between ResNet and LSTM that is still in research.
Better/simpler architectures are a hot topic of current research
Better understanding (both theoretical and empirical) is needed.
RNN is used for problems that uses sequences of related inputs more. Like NLP and Speech recognition.

11. Detection and Segmentation

So far we are talking about image classification problem. In this section we will talk about Segmentation, Localization, Detection.
Semantic Segmentation
- We want to Label each pixel in the image with a category label.
- As you see the cows in the image, Semantic Segmentation Don’t differentiate instances, only care about pixels.
- The first idea is to use a sliding window. We take a small window size and slide it all over the picture. For each window we want to label the center pixel.
  - It will work but its not a good idea because it will be computational expensive!
  - Very inefficient! Not reusing shared features between overlapping patches.
  - In practice nobody uses this.
- The second idea is designing a network as a bunch of Convolutional layers to make predictions for pixels all at once!
  - Input is the whole image. Output is the image with each pixel labeled.
  - We need a lot of labeled data. And its very expensive data.
  - It needs a deep Conv. layers.
  - The loss is cross entropy between each pixel provided.
  - Data augmentation are good here.
  - The problem with this implementation that convolutions at original image resolution will be very expensive.
  - So in practice we don’t see something like this right now.
- The third idea is based on the last idea. The difference is that we are downsampling and upsampling inside the network.
  - We downsample because using the whole image as it is very expensive. So we go on multiple layers downsampling and then upsampling in the end.
  - Downsampling is an operation like Pooling and strided convolution.
  - Upsampling is like “Nearest Neighbor” or “Bed of Nails” or “Max unpooling”
    - Nearest Neighbor example:

         Input:   1  2               Output:   1  1  2  2
                  3  4                         1  1  2  2
                                               3  3  4  4
                                               3  3  4  4

  - **Bed of Nails** example:

          Input:   1  2               Output:   1  0  2  0
                   3  4                         0  0  0  0
                                                3  0  4  0
                                                0  0  0  0

  * **Max unpooling** is depending on the earlier steps that was made by max pooling. You fill the pixel where max pooling took place and then fill other pixels by zero.
* Max unpooling seems to be the best idea for upsampling.
* There are an idea of Learnable Upsampling called "**Transpose Convolution**"
  * Rather than making a convolution we make the reverse.
  * Also called:
    * Upconvolution.
    * Fractionally strided convolution
    * Backward strided convolution
  * Learn the artimitic of the upsampling please refer to chapter 4 in this [paper](https://arxiv.org/abs/1603.07285).

Classification + Localization:
- In this problem we want to classify the main object in the image and its location as a rectangle.
- We assume there are one object.
- We will create a multi task NN. The architecture are as following:
  - Convolution network layers connected to:
    - FC layers that classify the object. # The plain classification problem we know
    - FC layers that connects to a four numbers (x,y,w,h)
      - We treat Localization as a regression problem.
- This problem will have two losses:
  - Softmax loss for classification
  - Regression (Linear loss) for the localization (L2 loss)
- Loss = SoftmaxLoss + L2 loss
- Often the first Conv layers are pretrained NNs like AlexNet!
- This technique can be used in so many other problems like: Human Pose Estimation.
Object Detection
- A core idea of computer vision. We will talk by details in this problem.
- The difference between “Classification + Localization” and this problem is that here we want to detect one or mode different objects and its locations!
- First idea is to use a sliding window
  - Worked well and long time.
  - The steps are:
    - Apply a CNN to many different crops of the image, CNN classifies each crop as object or background.
  - The problem is we need to apply CNN to huge number of locations and scales, very computationally expensive!
  - The brute force sliding window will make us take thousands of thousands of time.
- Region Proposals will help us deciding which region we should run our NN at:
  - Find blobby image regions that are likely to contain objects.
  - Relatively fast to run; e.g. Selective Search gives 1000 region proposals in a few seconds on CPU
- So now we can apply one of the Region proposals networks and then apply the first idea.
- There is another idea which is called R-CNN
  - The idea is bad because its taking parts of the image -With Region Proposalsif different sizes and feed it to CNN after scaling them all to one size. Scaling is bad
  - Also its very slow.
- Fast R-CNN is another idea that developed on R-CNN
  - It uses one CNN to do everything.
- Faster R-CNN does its own region proposals by Inserting Region Proposal Network (RPN) to predict proposals from features.
  - The fastest of the R-CNNs.
- Another idea is Detection without Proposals: YOLO / SSD
  - YOLO stands for you only look once.
  - YOLO/SDD is two separate algorithms.
  - Faster but not as accurate.
- Takeaways
  - Faster R-CNN is slower but more accurate.
  - SSD/YOLO is much faster but not as accurate.
Denese Captioning
- Denese Captioning is “Object Detection + Captioning”
- Paper that covers this idea can be found here.
Instance Segmentation
- This is like the full problem.
- Rather than we want to predict the bounding box, we want to know which pixel label but also distinguish them.
- There are a lot of ideas.
- There are a new idea “Mask R-CNN”
  - Like R-CNN but inside it we apply the Semantic Segmentation
  - There are a lot of good results out of this paper.
  - It sums all the things that we have discussed in this lecture.
  - Performance of this seems good.

12. Visualizing and Understanding

We want to know what’s going on inside ConvNets?
People want to trust the black box (CNN) and know how it exactly works and give and good decisions.
A first approach is to visualize filters of the first layer.
- Maybe the shape of the first layer filter is 5 x 5 x 3, and the number of filters are 16. Then we will have 16 different “colored” filter images.
- It turns out that these filters learns primitive shapes and oriented edges like the human brain does.
- These filters really looks the same on each Conv net you will train, Ex if you tried to get it out of AlexNet, VGG, GoogleNet, or ResNet.
- This will tell you what is the first convolution layer is looking for in the image.
We can visualize filters from the next layers but they won’t tell us anything.
- Maybe the shape of the first layer filter is 5 x 5 x 20, and the number of filters are 16. Then we will have 16*20 different “gray” filter images.
In AlexNet, there was some FC layers in the end. If we took the 4096-dimensional feature vector for an image, and collecting these feature vectors.
- If we made a nearest neighbors between these feature vectors and get the real images of these features we will get something very good compared with running the KNN on the images directly!
- This similarity tells us that these CNNs are really getting the semantic meaning of these images instead of on the pixels level!
- We can make a dimensionality reduction on the 4096 dimensional feature and compress it to 2 dimensions.
  - This can be made by PCA, or t-SNE.
  - t-SNE are used more with deep learning to visualize the data. Example can be found here.
We can Visualize the activation maps.
- For example if CONV5 feature map is 128 x 13 x 13, We can visualize it as 128 13 x 13 gray-scale images.
- One of these features are activated corresponding to the input, so now we know that this particular map are looking for something.
- Its done by Yosinski et. More info are here.
There are something called Maximally Activating Patches that can help us visualize the intermediate features in Convnets
- The steps of doing this is as following:
  - We choose a layer then a neuron
    - Ex. We choose Conv5 in AlexNet which is 128 x 13 x 13 then pick channel (Neuron) 17/128
  - Run many images through the network, record values of chosen channel.
  - Visualize image patches that correspond to maximal activations.
    - We will find that each neuron is looking into a specific part of the image.
    - Extracted images are extracted using receptive field.
Another idea is Occlusion Experiments
- We mask part of the image before feeding to CNN, draw heat-map of probability (Output is true) at each mask location
- It will give you the most important parts of the image in which the Conv. Network has learned from.
Saliency Maps tells which pixels matter for classification
- Like Occlusion Experiments but with a completely different approach
- We Compute gradient of (unnormalized) class score with respect to image pixels, take absolute value and max over RGB channels. It will get us a gray image that represents the most important areas in the image.
- This can be used for Semantic Segmentation sometimes.
(guided) backprop Makes something like Maximally Activating Patches but unlike it gets the pixels in which we are caring of.
- In this technique choose a channel like Maximally Activating Patches and then compute gradient of neuron value with respect to image pixels
- Images come out nicer if you only backprop positive gradients through each RELU (guided backprop)
Gradient Ascent
- Generate a synthetic image that maximally activates a neuron.
- Reverse of gradient decent. Instead of taking the minimum it takes the maximum.
- We want to maximize the neuron with the input image. So here instead we are trying to learn the image that maximize the activation:

* ```python
  # R(I) is Natural image regularizer, f(I) is the neuron value.
  I *= argmax(f(I)) + R(I)
  ```
*

Steps of gradient ascent
- Initialize image to zeros.
- Forward image to compute current scores.
- Backprop to get gradient of neuron value with respect to image pixels.
- Make a small update to the image
R(I) may equal to L2 of generated image.
To get a better results we use a better regularizer:
- penalize L2 norm of image; also during optimization periodically:
  - Gaussian blur image
  - Clip pixels with small values to 0
  - Clip pixels with small gradients to 0
A better regularizer makes out images cleaner!
The results in the latter layers seems to mean something more than the other layers.
We can fool CNN by using this procedure:
- Start from an arbitrary image. # Random picture based on nothing.
- Pick an arbitrary class. # Random class
- Modify the image to maximize the class.
- Repeat until network is fooled.
Results on fooling the network is pretty surprising!
- For human eyes they are the same, but it fooled the network by adding just some noise!
DeepDream: Amplify existing features
- Google released deep dream on their website.
- What its actually doing is the same procedure as fooling the NN that we discussed, but rather than synthesizing an image to maximize a specific neuron, instead try to amplify the neuron activations at some layer in the network.
- Steps:
  - Forward: compute activations at chosen layer. # form an input image (Any image)
  - Set gradient of chosen layer equal to its activation.
    - Equivalent to I* = arg max[I] sum(f(I)^2)
  - Backward: Compute gradient on image.
  - Update image.
- The code of deep dream is online you can download and check it yourself.
Feature Inversion
- Gives us to know what types of elements parts of the image are captured at different layers in the network.
- Given a CNN feature vector for an image, find a new image that:
  - Matches the given feature vector.
  - looks natural (image prior regularization)
Texture Synthesis
- Old problem in computer graphics.
- Given a sample patch of some texture, can we generate a bigger image of the same texture?
- There is an algorithm which doesn’t depend on NN:
  - Wei and Levoy, Fast Texture Synthesis using Tree-structured Vector Quantization, SIGGRAPH 2000
  - Its a really simple algorithm
- The idea here is that this is an old problem and there are a lot of algorithms that has already solved it but simple algorithms doesn’t work well on complex textures!
- An idea of using NN has been proposed on 2015 based on gradient ascent and called it “Neural Texture Synthesis”
  - It depends on something called Gram matrix.
Neural Style Transfer = Feature + Gram Reconstruction
- Gatys, Ecker, and Bethge, Image style transfer using Convolutional neural networks, CVPR 2016
- Implementation by pytorch here.
Style transfer requires many forward / backward passes through VGG; very slow!
- Train another neural network to perform style transfer for us!
- Fast Style Transfer is the solution.
- Johnson, Alahi, and Fei-Fei, Perceptual Losses for Real-Time Style Transfer and Super-Resolution, ECCV 2016
- https://github.com/jcjohnson/fast-neural-style
There are a lot of work on these style transfer and it continues till now!
Summary:
- Activations: Nearest neighbors, Dimensionality reduction, maximal patches, occlusion
- Gradients: Saliency maps, class visualization, fooling images, feature inversion
- Fun: DeepDream, Style Transfer

13. Generative models

Generative models are type of Unsupervised learning.
Supervised vs Unsupervised Learning:

	Supervised Learning	Unsupervised Learning
Data structure	Data: (x, y), and x is data, y is label	Data: x, Just data, no labels!
Data price	Training data is expensive in a lot of cases.	Training data are cheap!
Goal	Learn a function to map x -> y	Learn some underlying hidden structure of the data
Examples	Classification, regression, object detection, semantic segmentation, image captioning	Clustering, dimensionality reduction, feature learning, density estimation

Autoencoders are a Feature learning technique.
- It contains an encoder and a decoder. The encoder downsamples the image while the decoder upsamples the features.
- The loss are L2 loss.
Density estimation is where we want to learn/estimate the underlaying distribution for the data!
There are a lot of research open problems in unsupervised learning compared with supervised learning!
Generative Models
- Given training data, generate new samples from same distribution.
- Addresses density estimation, a core problem in unsupervised learning.
- We have different ways to do this:
  - Explicit density estimation: explicitly define and solve for the learning model.
  - Learn model that can sample from the learning model without explicitly defining it.
- Why Generative Models?
  - Realistic samples for artwork, super-resolution, colorization, etc
  - Generative models of time-series data can be used for simulation and planning (reinforcement learning applications!)
  - Training generative models can also enable inference of latent representations that can be useful as general features
- Taxonomy of Generative Models:
- In this lecture we will discuss: PixelRNN/CNN, Variational Autoencoder, and GANs as they are the popular models in research now.
PixelRNN and PixelCNN
- In a full visible belief network we use the chain rule to decompose likelihood of an image x into product of 1-d distributions
  - p(x) = sum(p(x[i]| x[1]x[2]....x[i-1]))
  - Where p(x) is the Likelihood of image x and x[i] is Probability of i’th pixel value given all previous pixels.
- To solve the problem we need to maximize the likelihood of training data but the distribution is so complex over pixel values.
- Also we will need to define ordering of previous pixels.
- PixelRNN
  - Founded by [van der Oord et al. 2016]
  - Dependency on previous pixels modeled using an RNN (LSTM)
  - Generate image pixels starting from corner
  - Drawback: sequential generation is slow! because you have to generate pixel by pixel!
- PixelCNN
  - Also Founded by [van der Oord et al. 2016]
  - Still generate image pixels starting from corner.
  - Dependency on previous pixels now modeled using a CNN over context region
  - Training is faster than PixelRNN (can parallelize convolutions since context region values known from training images)
  - Generation must still proceed sequentially still slow.
- There are some tricks to improve PixelRNN & PixelCNN.
- PixelRNN and PixelCNN can generate good samples and are still active area of research.
Autoencoders
- Unsupervised approach for learning a lower-dimensional feature representation from unlabeled training data.
- Consists of Encoder and decoder.
- The encoder:
  - Converts the input x to the features z. z should be smaller than x to get only the important values out of the input. We can call this dimensionality reduction.
  - The encoder can be made with:
    - Linear or non linear layers (earlier days days)
    - Deep fully connected NN (Then)
    - RELU CNN (Currently we use this on images)
- The decoder:
  - We want the encoder to map the features we have produced to output something similar to x or the same x.
  - The decoder can be made with the same techniques we made the encoder and currently it uses a RELU CNN.
- The encoder is a conv layer while the decoder is deconv layer! Means Decreasing and then increasing.
- The loss function is L2 loss function:
  - L[i] = |y[i] - y'[i]|^2
    - After training we though away the decoder.# Now we have the features we need
- We can use this encoder we have to make a supervised model.
  - The value of this it can learn a good feature representation to the input you have.
  - A lot of times we will have a small amount of data to solve problem. One way to tackle this is to use an Autoencoder that learns how to get features from images and train your small dataset on top of that model.
- The question is can we generate data (Images) from this Autoencoder?
Variational Autoencoders (VAE)
- Probabilistic spin on Autoencoders - will let us sample from the model to generate data!
- We have z as the features vector that has been formed using the encoder.
- We then choose prior p(z) to be simple, e.g. Gaussian.
  - Reasonable for hidden attributes: e.g. pose, how much smile.
- Conditional p(x|z) is complex (generates image) => represent with neural network
- But we cant compute integral for P(z)p(x|z)dz as the following equation:
- After resolving all the equations that solves the last equation we should get this:
- Variational Autoencoder are an approach to generative models but Samples blurrier and lower quality compared to state-of-the-art (GANs)
- Active areas of research:
  - More flexible approximations, e.g. richer approximate posterior instead of diagonal Gaussian
  - Incorporating structure in latent variables
Generative Adversarial Networks (GANs)
- GANs don’t work with any explicit density function!
- Instead, take game-theoretic approach: learn to generate from training distribution through 2-player game.
- Yann LeCun, who oversees AI research at Facebook, has called GANs:
  - The coolest idea in deep learning in the last 20 years
- Problem: Want to sample from complex, high-dimensional training distribution. No direct way to do this as we have discussed!
- Solution: Sample from a simple distribution, e.g. random noise. Learn transformation to training distribution.
- So we create a noise image which are drawn from simple distribution feed it to NN we will call it a generator network that should learn to transform this into the distribution we want.
- Training GANs: Two-player game:
  - Generator network: try to fool the discriminator by generating real-looking images.
  - Discriminator network: try to distinguish between real and fake images.
- If we are able to train the Discriminator well then we can train the generator to generate the right images.
- The loss function of GANs as minimax game are here:
- The label of the generator network will be 0 and the real images are 1.
- To train the network we will do:
  - Gradient ascent on discriminator.
  - Gradient ascent on generator but with different loss.
- You can read the full algorithm with the equations here:
- Aside: Jointly training two networks is challenging, can be unstable. Choosing objectives with better loss landscapes helps training is an active area of research.
- Convolutional Architectures:
  - Generator is an upsampling network with fractionally-strided convolutions Discriminator is a Convolutional network.
  - Guidelines for stable deep Conv GANs:
    - Replace any pooling layers with strided convs (discriminator) and fractional-strided convs with (Generator).
    - Use batch norm for both networks.
    - Remove fully connected hidden layers for deeper architectures.
    - Use RELU activation in generator for all layers except the output which uses Tanh
    - Use leaky RELU in discriminator for all the layers.
- 2017 is the year of the GANs! it has exploded and there are some really good results.
- Active areas of research also is GANs for all kinds of applications.
- The GAN zoo can be found here: https://github.com/hindupuravinash/the-gan-zoo
- Tips and tricks for using GANs: https://github.com/soumith/ganhacks
- NIPS 2016 Tutorial GANs: https://www.youtube.com/watch?v=AJVyzd0rqdc

14. Deep reinforcement learning

This section contains a lot of math.
Reinforcement learning problems are involving an agent interacting with an environment, which provides numeric reward signals.
Steps are:
- Environment –> State s[t] –> Agent –> Action a[t] –> Environment –> Reward r[t] + Next state s[t+1] –> Agent –> and so on..
Our goal is learn how to take actions in order to maximize reward.
An example is Robot Locomotion:
- Objective: Make the robot move forward
- State: Angle and position of the joints
- Action: Torques applied on joints
- 1 at each time step upright + forward movement
Another example is Atari Games:
- Deep learning has a good state of art in this problem.
- Objective: Complete the game with the highest score.
- State: Raw pixel inputs of the game state.
- Action: Game controls e.g. Left, Right, Up, Down
- Reward: Score increase/decrease at each time step
Go game is another example which AlphaGo team won in the last year (2016) was a big achievement for AI and deep learning because the problem was so hard.
We can mathematically formulate the RL (reinforcement learning) by using Markov Decision Process
Markov Decision Process
- Defined by (S, A, R, P, Y) where:
  - S: set of possible states.
  - A: set of possible actions
  - R: distribution of reward given (state, action) pair
  - P: transition probability i.e. distribution over next state given (state, action) pair
    - Y: discount factor # How much we value rewards coming up soon verses later on.
- Algorithm:
  - At time step t=0, environment samples initial state s[0]
  - Then, for t=0 until done:
    - Agent selects action a[t]
    - Environment samples reward from R with (s[t], a[t])
    - Environment samples next state from P with (s[t], a[t])
    - Agent receives reward r[t] and next state s[t+1]
- A policy pi is a function from S to A that specifies what action to take in each state.
- Objective: find policy pi* that maximizes cumulative discounted reward: Sum(Y^t * r[t], t>0)
- For example:
- Solution would be:
The value function at state s, is the expected cumulative reward from following the policy from state s:
- V[pi](s) = Sum(Y^t * r[t], t>0) given s0 = s, pi
The Q-value function at state s and action a, is the expected cumulative reward from taking action a in state s and then following the policy:
- Q[pi](s,a) = Sum(Y^t * r[t], t>0) given s0 = s,a0 = a, pi
The optimal Q-value function Q* is the maximum expected cumulative reward achievable from a given (state, action) pair:
- Q*[s,a] = Max(for all of pi on (Sum(Y^t * r[t], t>0) given s0 = s,a0 = a, pi))
Bellman equation
- Important thing is RL.
- Given any state action pair (s,a) the value of this pair is going to be the reward that you are going to get r plus the value of the state that you end in.
- Q*[s,a] = r + Y * max Q*(s',a') given s,a # Hint there is no policy in the equation
- The optimal policy pi* corresponds to taking the best action in any state as specified by Q*
We can get the optimal policy using the value iteration algorithm that uses the Bellman equation as an iterative update
Due to the huge space dimensions in real world applications we will use a function approximator to estimate Q(s,a). E.g. a neural network! this is called Q-learning
- Any time we have a complex function that we cannot represent we use Neural networks!
Q-learning
- The first deep learning algorithm that solves the RL.
- Use a function approximator to estimate the action-value function
- If the function approximator is a deep neural network => deep q-learning
- The loss function:
Now lets consider the “Playing Atari Games” problem:
- Our total reward are usually the reward we are seeing in the top of the screen.
- Q-network Architecture:
- Learning from batches of consecutive samples is a problem. If we recorded a training data and set the NN to work with it, if the data aren’t enough we will go to a high bias error. so we should use “experience replay” instead of consecutive samples where the NN will try the game again and again until it masters it.
- Continually update a replay memory table of transitions (s[t] , a[t] , r[t] , s[t+1]) as game (experience) episodes are played.
- Train Q-network on random minibatches of transitions from the replay memory, instead of consecutive samples.
- The full algorithm:
- A video that demonstrate the algorithm on Atari game can be found here: “https://www.youtube.com/watch?v=V1eYniJ0Rnk"
Policy Gradients
- The second deep learning algorithm that solves the RL.
- The problem with Q-function is that the Q-function can be very complicated.
  - Example: a robot grasping an object has a very high-dimensional state.
  - But the policy can be much simpler: just close your hand.
- Can we learn a policy directly, e.g. finding the best policy from a collection of policies?
- Policy Gradients equations:
- Converges to a local minima of J(ceta), often good enough!
- REINFORCE algorithm is the algorithm that will get/predict us the best policy
- Equation and intuition of the Reinforce algorithm:
  - the problem was high variance with this equation can we solve this?
  - variance reduction is an active research area!
- Recurrent Attention Model (RAM) is an algorithm that are based on REINFORCE algorithm and is used for image classification problems:
  - Take a sequence of “glimpses” selectively focusing on regions of the image, to predict class
    - Inspiration from human perception and eye movements.
    - Saves computational resources => scalability
      - If an image with high resolution you can save a lot of computations
    - Able to ignore clutter / irrelevant parts of image
  - RAM is used now in a lot of tasks: including fine-grained image recognition, image captioning, and visual question-answering
- AlphaGo are using a mix of supervised learning and reinforcement learning, It also using policy gradients.
A good course from Standford on deep reinforcement learning
- http://web.stanford.edu/class/cs234/index.html
- https://www.youtube.com/playlist?list=PLkFD6_40KJIwTmSbCv9OVJB3YaO4sFwkX
A good course on deep reinforcement learning (2017)
- http://rll.berkeley.edu/deeprlcourse/
- https://www.youtube.com/playlist?list=PLkFD6_40KJIznC9CDbVTjAF2oyt8_VAe3
A good article
- https://www.kdnuggets.com/2017/09/5-ways-get-started-reinforcement-learning.html

15. Efficient Methods and Hardware for Deep Learning

The original lecture was given by Song Han a PhD Candidate at standford.
Deep Conv nets, Recurrent nets, and deep reinforcement learning are shaping a lot of applications and changing a lot of our lives.
- Like self driving cars, machine translations, alphaGo and so on.
But the trend now says that if we want a high accuracy we need a larger (Deeper) models.
- The model size in ImageNet competation from 2012 to 2015 has increased 16x to achieve a high accurecy.
- Deep speech 2 has 10x training operations than deep speech 1 and thats in only one year! # At Baidu
There are three challenges we got from this
- Model Size
  - Its hard to deploy larger models on our PCs, mobiles, or cars.
- Speed
  - ResNet152 took 1.5 weeks to train and give the 6.16% accurecy!
  - Long training time limits ML researcher’s productivity
- Energy Efficiency
  - AlphaGo: 1920 CPUs and 280 GPUs. $3000 electric bill per game
  - If we use this on our mobile it will drain the battery.
  - Google mentioned in thier blog if all the users used google speech for 3 minutes, they have to double thier data-center!
  - Where is the Energy Consumed?
    - larger model => more memory reference => more energy
We can improve the Efficiency of Deep Learning by Algorithm-Hardware Co-Design.
- From both the hardware and the algorithm perspectives.
Hardware 101: the Family
- General Purpose # Used for any hardware
  - CPU # Latency oriented, Single strong threaded like a single elepahnt
    - GPU # Throughput oriented, So many small threads like a lot of ants
  - GPGPU
    - Specialized HW #Tuned for a domain of applications
      - FPGA# Programmable logic, Its cheaper but less effiecnet`
      - ASIC# Fixed logic, Designed for a certian applications (Can be designed for deep learning applications)
Hardware 101: Number Representation
- Numbers in computer are represented with a discrete memory.
- Its very good and energy efficent for hardware to go from 32 bit to 16 bit in float point operations.
Part 1: Algorithms for Efficient Inference
- Pruning neural networks
  - Idea is can we remove some of the weights/neurons and the NN still behave the same?
  - In 2015 Han made AlexNet parameters from 60 million to 6 Million! by using the idea of Pruning.
  - Pruning can be applied to CNN and RNN, iteratively it will reach the same accurecy as the original.
  - Pruning actually happends to humans:
    - Newborn(50 Trillion Synapses) ==> 1 year old(1000 Trillion Synapses) ==> Adolescent(500 Trillion Synapses)
  - Algorithm:
    1. Get Trained network.
    2. Evaluate importance of neurons.
    3. Remove the least important neuron.
    4. Fine tune the network.
    5. If we need to continue Pruning we go to step 2 again else we stop.
- Weight Sharing
  - The idea is that we want to make the numbers is our models less.
  - Trained Quantization:
    - Example: all weight values that are 2.09, 2.12, 1.92, 1.87 will be replaced by 2
    - To do that we can make k means clustering on a filter for example and reduce the numbers in it. By using this we can also reduce the number of operations that are used from calculating the gradients.
    - After Trained Quantization the Weights are Discrete.
    - Trained Quantization can reduce the number of bits we need for a number in each layer significantly.
  - Pruning + Trained Quantization can Work Together to reduce the size of the model.
  - Huffman Coding
    - We can use Huffman Coding to reduce/compress the number of bits of the weight.
    - In-frequent weights: use more bits to represent.
    - Frequent weights: use less bits to represent.
  - Using Pruning + Trained Quantization + Huffman Coding is called deep compression.
    - SqueezeNet
      - All the models we have talked about till now was using a pretrained models. Can we make a new arcitecutre that saves memory and computations?
      - SqueezeNet gets the alexnet accurecy with 50x fewer parameters and 0.5 model size.
    - SqueezeNet can even be further compressed by applying deep compression on them.
    - Models are now more energy efficient and has speed up a lot.
    - Deep compression was applied in Industry through facebook and Baidu.
- Quantization
  - Algorithm (Quantizing the Weight and Activation):
    - Train with float.
    - Quantizing the weight and activation:
      - Gather the statistics for weight and activation.
      - Choose proper radix point position.
    - Fine-tune in float format.
    - Convert to fixed-point format.
- Low Rank Approximation
  - Is another size reduction algorithm that are used for CNN.
  - Idea is decompose the conv layer and then try both of the composed layers.
- Binary / Ternary Net
  - Can we only use three numbers to represent weights in NN?
  - The size will be much less with only -1, 0, 1.
  - This is a new idea that was published in 2017 “Zhu, Han, Mao, Dally. Trained Ternary Quantization, ICLR’17”
  - Works after training.
  - They have tried it on AlexNet and it has reached almost the same error as AlexNet.
  - Number of operation will increase per register: https://xnor.ai/
- Winograd Transformation
  - Based on 3x3 WINOGRAD Convolutions which makes less operations than the ordiany convolution
  - cuDNN 5 uses the WINOGRAD Convolutions which has improved the speed.
Part 2: Hardware for Efficient Inference
- There are a lot of ASICs that we developed for deep learning. All in which has the same goal of minimize memory access.
  - Eyeriss MIT
  - DaDiannao
  - TPU Google (Tensor processing unit)
    - It can be put to replace the disk in the server.
    - Up to 4 cards per server.
    - Power consumed by this hardware is a lot less than a GPU and the size of the chip is less.
  - EIE Standford
    - By Han at 2016 [et al. ISCA’16]
    - We don’t save zero weights and make quantization for the numbers from the hardware.
    - He says that EIE has a better Throughput and energy efficient.
Part 3: Algorithms for Efficient Training
- Parallelization
  - Data Parallel–Run multiple inputs in parallel
    - Ex. Run two images in the same time!
    - Run multiple training examples in parallel.
    - Limited by batch size.
    - Gradients have to be applied by a master node.
  - Model Parallel
    - Split up the Model–i.e. the network
    - Split model over multiple processors By layer.
  - Hyper-Parameter Parallel
    - Try many alternative networks in parallel.
    - Easy to get 16-64 GPUs training one model in parallel.
- Mixed Precision with FP16 and FP32
  - We have discussed that if we use 16 bit real numbers all over the model the energy cost will be less by x4.
  - Can we use a model entirely with 16 bit number? We can partially do this with mixed FP16 and FP32. We use 16 bit everywhere but at some points we need the FP32.
  - By example in multiplying FP16 by FP16 we will need FP32.
  - After you train the model you can be a near accuracy of the famous models like AlexNet and ResNet.
- Model Distillation
  - The question is can we use a senior (Good) trained neural network(s) and make them guide a student (New) neural network?
  - For more information look at Hinton et al. Dark knowledge / Distilling the Knowledge in a Neural Network
- DSD: Dense-Sparse-Dense Training
  - Han et al. “DSD: Dense-Sparse-Dense Training for Deep Neural Networks”, ICLR 2017
  - Has a better regularization.
  - The idea is Train the model lets call this the Dense, we then apply Pruning to it lets call this sparse.
  - DSD produces same model architecture but can find better optimization solution arrives at better local minima, and achieves higher prediction accuracy.
  - After the above two steps we go connect the remain connection and learn them again (To dense again).
  - This improves the performace a lot in many deep learning models.
Part 4: Hardware for Efficient Training
- GPUs for training:
  - Nvidia PASCAL GP100 (2016)
  - Nvidia Volta GV100 (2017)
    - Can make mixed precision operations!
    - So powerful.
    - The new neclar bomb!
- Google Announced “Google Cloud TPU” on May 2017!
  - Cloud TPU delivers up to 180 teraflops to train and run machine learning models.
  - One of our new large-scale translation models used to take a full day to train on 32 of the best commercially-available GPUs—now it trains to the same accuracy in an afternoon using just one eighth of a TPU pod.
We have moved from PC Era ==> Mobile-First Era ==> AI-First Era

16. Adversarial Examples and Adversarial Training

What are adversarial examples?
- Since 2013, deep neural networks have matched human performance at..
  - Face recognition
  - Object recognition
  - Captcha recognition
    - Because its accuracy was higher than humans, Websites tried to find another solution than Captcha.
  - And other tasks..
- Before 2013 no body was surprised if they saw a computer made a mistake! But now the deep learning exists and its so important to know the problems and the causes.
- Adversarial are problems and unusual mistake that deep learning make.
- This topic wasn’t hot until deep learning can now do better and better than human!
- An adversarial is an example that has been carefully computed to to be misclassified.
- In a lot of cases the adversarial image isn’t changed much compared to the original image from the human perspective.
- History of recent papers:
  - Biggio 2013: fool neural nets.
  - Szegedy et al 2013: fool ImageNet classifiers imperceptibly
  - Goodfellow et al 2014: cheap, closed form attack.
- So the first story was in 2013. When Szegedy had a CNN that can classify images very well.
  - He wanted to understand more about how CNN works to improve it.
  - He give an image of an object and by using gradient ascent he tried to update the images so that it can be another object.
  - Strangely he found that the result image hasn’t changed much from the human perspective!
  - If you tried it you won’t notify any change and you will think that this is a bug! but it isn’t if you go for the image you will notice that they are completely different!
- These mistakes can be found in almost any deep learning algorithm we have studied!
  - It turns out that RBF (Radial Basis Network) can resist this.
  - Deep Models for Density Estimation can resist this.
- Not just for neural nets can be fooled:
  - Linear models
    - Logistic regression
    - Softmax regression
    - SVMs
  - Decision trees
  - Nearest neighbors
Why do adversarial happen?
- In the process in trying to understand what is happening, in 2016 they thought it was from overfitting models in the high dimensional data case.
  - Because in such high dimensions we could have some random errors which can be found.
  - So if we trained a model with another parameters it should not make the same mistake?
  - They found that not right. Models are reaching to the same mistakes so it doesn’t mean its overfitting.
- In the previous mentioned experiment the found that the problem is caused by systematic thing not a random.
  - If they add some vector to an example it would misclassified to any model.
- Maybe they are coming from underfitting not overfitting.
- Modern deep nets are very piecewise linear
  - Rectified linear unit
  - Carefully tuned sigmoid # Most of the time we are inside the linear curve
  - Maxout
  - LSTM
- Relation between the parameter and the output are non linear because it’s multiplied together thats what make training NN difficult, while mapping from linear from input and output are linear and much easier.
How can adversarial be used to compromise machine learning systems?
- If we are experimenting how easy a NN to fool, We want to make sure we are actually fooling it not just changing the output class, and if we are attackers we want to make this behavior to the NN (Get hole).
- When we build Adversarial example we use the max norm constrain to perturbation.
- The fast gradient sign method:
  - This method comes from the fact that almost all NN are using a linear activations (Like RELU) the assumption we have told before.
  - No pixel can be changed more than some amount epsilon.
  - Fast way is to take the gradient of the cost you used to train the network with respect to the input and then take the sign of that gradient multiply this by epsilon.
  - Equation:
    - Xdash = x + epslion * (sign of the gradient)
    - Where Xdash is the adversarial example and x is the normal example
  - So it can be detected by just using the sign (direction) and some epsilon.
- Some attacks are based on ADAM optimizer.
- Adversarial examples are not random noises!
- NN are trained on some distribution and behaves well in that distribution. But if you shift this distribution the NN won’t answer the right answers. They will be so easy to fool.
- deep RL can also be fooled.
- Attack of the weights:
  - In linear models, We can take the learned weights image, take the signs of the image and add it to any example to force the class of the weights to be true. Andrej Karpathy, “Breaking Linear Classifiers on ImageNet”
- It turns out that some of the linaer models performs well (We cant get advertisal from them easily)
  - In particular Shallow RBFs network resist adversarial perturbation # By The fast gradient sign method
    - The problem is RBFs doesn’t get so much accuracy on the datasets because its just a shallow model and if you tried to get this model deeper the gradients will become zero in almost all the layers.
    - RBFs are so difficult to train even with batch norm. algorithm.
    - Ian thinks if we have a better hyper parameters or a better optimization algorithm that gradient decent we will be able to train RBFs and solve the adversarial problem!
- We also can use another model to fool current model. Ex use an SVM to fool a deep NN.
  - For more details follow the paper: “Papernot 2016”
- Transferability Attack
  1. Target model with unknown weights, machine learning algorithm, training set; maybe non differentiable
  2. Make your training set from this model using inputs from you, send them to the model and then get outputs from the model
  3. Train you own model. “Following some table from Papernot 2016”
  4. Create an Adversarial example on your model.
  5. Use these examples against the model you are targeting.
  6. You are almost likely to get good results and fool this target!
- In Transferability Attack to increase your probability by 100% of fooling a network, You can make more than just one model may be five models and then apply them. “(Liu et al, 2016)”
- Adversarial Examples are works for human brain also! for example images that tricks your eyes. They are a lot over the Internet.
- In practice some researches have fooled real models from (MetaMind, Amazon, Google)
- Someone has uploaded some perturbation into facebook and facebook was fooled :D
What are the defenses?
- A lot of defenses Ian tried failed really bad! Including:
  - Ensembles
  - Weight decay
  - Dropout
  - Adding noise at train time or at test time
  - Removing perturbation with an autoencoder
  - Generative modeling
- Universal approximator theorem
  - Whatever shape we would like our classification function to have a big enough NN can make it.
  - We could have train a NN that detects the Adversarial!
- Linear models & KNN can be fooled easier than NN. Neural nets can actually become more secure than other models. Adversarial trained neural nets have the best empirical success rate on adversarial examples of any machine learning model.
  - Deep NNs can be trained with non linear functions but we will just need a good optimization technique or solve the problem with using such linear activator like “RELU”
How to use adversarial examples to improve machine learning, even when there is no adversary?
- Universal engineering machine (model-based optimization) #Is called Universal engineering machine by Ian
  - For example:
    - Imagine that we want to design a car that are fast.
    - We trained a NN to look at the blueprints of a car and tell us if the blueprint will make us a fast car or not.
    - The idea here is to optimize the input to the network so that the output will max this could give us the best blueprint for a car!
  - Make new inventions by finding input that maximizes model’s predicted performance.
  - Right now by using adversarial examples we are just getting the results we don’t like but if we have solve this problem we can have the fastest car, the best GPU, the best chair, new drugs…..
- The whole adversarial is an active area of research especially defending the network!
Conclusion
- Attacking is easy
- Defending is difficult
- Adversarial training provides regularization and semi-supervised learning
- The out-of-domain input problem is a bottleneck for model-based optimization generally
There are a Github code that can make you learn everything about adversarial by code (Built above tensorflow):
- An adversarial example library for constructing attacks, building defenses, and benchmarking both: https://github.com/tensorflow/cleverhans

These Notes was made by Mahmoud Badry @2017

3 - MK Internet of Things

MK Internet of Things

Kode: TKE194945 Internet of Things
SKS: 3 SKS
Jadwal
- Kelas A: Ruang E-205, Kamis 13.00, 12 mhs, 1 mhs MBKM

4 - MK Pengolahan Sinyal Digital

MK Pengolahan Sinyal Digital

Kode: TKE192227 Pengolahan Sinyal Digital
SKS : 3 SKS
Jadwal:
- Kelas B: Ruang E-101, Rabu 07.00, 35 mhs
- Kelas A: Ruang E-101, Rabu 09.45, 50 mhs

Identitas

Kode Mata Kuliah: TKE192227
SKS Mata Kuliah: 3 SKS
Semester Mata Kuliah: 4
Sifat Mata Kuliah: Teknik Elektro Inti (TEI)

Materi

Pertemuan 1
Pertemuan 2
Pertemuan 3
Pertemuan 4
Pertemuan 5
Pertemuan 6
Pertemuan 7

Referensi

Li Tan, Digital Signal Processing

Link

5 - MK Sistem Kendali Cerdas

MK Sistem Kendali Cerdas

Identitas

Kode : TKE194941 Sistem Kendali Cerdas
SKS : 3 SKS
Jadwal:
- Kelas A : Ruang E-201, Jum’at 13.55, 3 mhs
Metode: Case-based dan Project-based Learning
Semester Mata Kuliah: 6
Sifat Mata Kuliah: Teknik Elektro Pendalaman (TED)

Materi

Pendahuluan
Dasar-dasar Logika Fuzzy
Sistem Inferensi Fuzzy
Sistem Inferensi Fuzzy untuk Sistem Kendali
Proyek Sistem Inferensi Fuzzy untuk Sistem Kendali
Proyek Sistem Inferensi Fuzzy untuk Sistem Kendali
Pendahuluan Neural Network
Neural Network dalam Sistem Kendali
Neural Network dalam Sistem Kendali
Neural Network dalam Sistem Kendali
Sistem Neuro-Fuzzy
Sistem Neuro-Fuzzy untuk Sistem Kendali
Proyek Sistem Neuro-Fuzzy untuk Sistem Kendali
Proyek Sistem Neuro-Fuzzy untuk Sistem Kendali

Referensi Utama

Liu Jinkun, Intelligent Control Design and MATLAB Simulation [website] [m-file download]
Fuzzy and Neural Control by Babuska
Intelligent Control - A Hybrid Approach Based on Fuzzy Logic, Neural Networks and Genetic Algorithms - Nazmul Siddique - Springer
Himanshu Singh & Yunis Ahmad Lone, Deep Neuro-Fuzzy Systems With Python: With Case Studies and Applications From the Industry [website][python download]
Hung T. Nguyen & Nadipuram R. Prasad & Carol L. Walker & Elbert A. Walker, A First Course in Fuzzy and Neural Control [website]

Referensi Tambahan

Roland S Burns, Advanced Control Engineering (Chapter 10) [website]
Ali Zilouchian & Mo Jamshidi, Intelligent Control Systems Using Soft Computing Methodologies, [website][ebook download]
Adrian A. Hopgood, Intelligent Systems for Engineers and Scientists, websites
Adedeji Bodunde Badiru, Fuzzy Engineering Expert Systems With Neural Network Applications
Ahmad M. Ibrahim, Fuzzy Logic for Embedded Systems Applications [website]
Erdal Kayacan & Mojtaba Ahmadieh, Fuzzy Neural Networks for Real Time Control Applications: Concepts, Modeling and Algorithms for Fast Learning [website]
James M. Keller & Derong Liu & David B Fogel, Fundamentals of Computational Intelligence: Neural Networks, Fuzzy Systems, and Evolutionary Computation [website]
Steven L Brunton & J Nathan Kutz, Data-Driven Science and Engineering: Machine Learning, Dynamical Systems, and Control [website][ebook download][MATLAB Codeand Data][Python Codeand Data]
Intelligent Control: Fuzzy Logic Applications - 1st Edition - Clarence
Fuzzy Logic in Control - René Jager - Google Books

Software Links

GNU Octave
Octave Online
MATLAB and Simulink
Anaconda
Google Colab
Fuzzylite : The FuzzyLite Libraries for Fuzzy Logic Control

Video Links

E-learning Link

Neuro-fuzzy in Python

Libraries

numpy conda install -c conda-forge numpy, pip install numpy
scipy conda install -c conda-forge scipy, pip install scipy
scikit fuzzy conda install -c conda-forge scikit-fuzzy, pip install scikit-fuzzy
scikit learn conda install -c conda-forge scikit-learn, pip install scikit-learn
fuzzylite pip install pyfuzzylite
pandas conda install -c conda-forge pandas, pip install pandas
statsmodels conda install -c conda-forge statsmodels,pip install statsmodels
keras conda install -c conda-forge keras, pip install keras
anfis pip install anfis
bokeh conda install -c conda-forge bokeh, pip install bokeh
fuzzycmeans pip install fuzzycmeans

Downgrade Python for installing keras and tensorflow

python --version
conda search python : check installed version of python
conda install python=3.6.0: downgrade to your preferred python

6 - MK Sistem Kendali

MK Sistem Kendali

Identitas MK

Kode Mata Kuliah: TKE192221
SKS Mata Kuliah: 2 SKS
Semester Mata Kuliah: 4
Sifat Mata Kuliah: Teknik Elektro Inti (TEI)
Jadwal :
- Kelas A : Ruang C-201, Selasa 07.55, 45 mhs
- Kelas B : Ruang C-201, Selasa 10.40, 47 mhs

Referensi

Norman S. Nise, Control Systems Engineering [website]
Katsuhiko Ogata, Modern Control Engineering
Richard C. Dorf and Robert H. Bishop, Modern Control Systems [website]
Farid Golnaraghi and Benjamin C. Kuo, Automatic Control Systems [website]
Brian Douglas, The Fundamentals of Control Theory [website][ebook]
Pao C. Chau, Process Control: A First Course With MATLAB [website]
Karl J. Åström and Richard M. Murray, Feedback Systems: An Introduction for Scientists and Engineers [website]
R.V. Dukkipati, Analysis and Design of Control Systems using MATLAB
Book: Introduction to Control Systems (Iqbal) - Engineering LibreTexts License: CC-BY-NC
Book: Chemical Process Dynamics and Controls (Woolf) - Engineering LibreTexts License: CC-BY

Software

GNU Octave
Octave Online
MATLAB - MathWorks - MATLAB & Simulink
Python and Jupyter Notebook in Anaconda.org or in Google Colab
Visual Model Q

Interactive Learning

Video

Kuliah

01-Pendahuluan Sistem Kendali

Interactive Course for Control Theory

Akses situs Interactive Course for Control Theory
Buat akun ICCT, cek email untuk mendapatkan username dan password
Login ke Interactive Course for Control Theory
Selanjutnya anda akan berinteraksi dengan Jupyter Notebook di ICCT
Klik folder ICCT pada Jupyter Notebook, lalu klik Table-of-Contents-ICCT.ipynb
Klik salah satu link, misalnya 1.1.1 Complex Numbers in Cartesian Form di folder 1.1 Complex Numbers
Pada link tersebut, anda berada di Jupyter Notebook M-01_Complex_numbers_Cartesian_form.ipynb
Tidak perlu terlalu panik dengan kode Python yang muncul.
Pilih menu lalu
Silakan baca Notebook-nya, pahami penjelasan atau penugasannya.
Lalu anda secara interaktif melakukan pengubahan berbagai menu di dalam Notebook.
Anda dapat pula unduh atau screenshoot citranya.
Jika sudah cukup dan selesai, pilih menu lalu untuk mematikan Jupyter Notebook. Biasakan untuk melakukan hal ini setiap kali selesai bekerja dengan Jupyter Notebook.

Pertemuan 2

Pertemuan 3

Pertemuan 4

Pertemuan 5

Pertemuan 6

Pertemuan 7

UTS

Pertemuan 8

Pertemuan 9

Pertemuan 10

Pertemuan 11

Pertemuan 12

Pertemuan 13

Pertemuan 14

UAS

7 - Rangkaian Listrik

Rangkaian Listrik

Rangkaian AC (steady state)

$$ X_L=\omega L $$
$$ Z_L=jX_L=j\omega L=L\angle 90^{\circ} $$
$$ X_C= \frac{1}{\omega C} $$
$$ Z_C=-jX_C=\frac{-j}{\omega C}=\frac{1}{j\omega C}=C\angle -90^{\circ} $$

Rangkaian DC (steady state)

Pada DC, L adalah short circuit, C adalah open circuit. Bisa dicermati karena $\omega = 2 \pi f$ dengan $f=0$.

8 - Embedded Systems

Embedded Systems

Free Book

F(E) Foundations of Embedded Systems

Free Course

Digital Design Course - edX

9 - Machine Learning Andrew Ng Quizzes

Machine Learning Andrew Ng Quizzes

Week 1

Introduction

A computer program is said to learn from experience E with respect to some task T and some performance measure P if its performance on T, as measured by P, improves with experience E. Suppose we feed a learning algorithm a lot of historical weather data, and have it learn to predict weather. What would be a reasonable choice for P?
- 🗹 The probability of it correctly predicting a future date’s weather.
- ☐ The weather prediction task.
- ☐ The process of the algorithm examining a large amount of historical weather data.
- ☐ None of these.
A computer program is said to learn from experience E with respect to some task T and some performance measure P if its performance on T, as measured by P, improves with experience E. Suppose we feed a learning algorithm a lot of historical weather data, and have it learn to predict weather. In this setting, what is T?
- 🗹 The weather prediction task.
- ☐ None of these.
- ☐ The probability of it correctly predicting a future date’s weather.
- ☐ The process of the algorithm examining a large amount of historical weather data.
Suppose you are working on weather prediction, and use a learning algorithm to predict tomorrow’s temperature (in degrees Centigrade/Fahrenheit).
Would you treat this as a classification or a regression problem?
- 🗹 Regression
- ☐ Classification
Suppose you are working on weather prediction, and your weather station makes one of three predictions for each day’s weather: Sunny, Cloudy or Rainy. You’d like to use a learning algorithm to predict tomorrow’s weather.
Would you treat this as a classification or a regression problem?
- ☐ Regression
- 🗹 Classification
Suppose you are working on stock market prediction, and you would like to predict the price of a particular stock tomorrow (measured in dollars). You want to use a learning algorithm for this.
Would you treat this as a classification or a regression problem?
- 🗹 Regression
- ☐ Classification
Suppose you are working on stock market prediction. You would like to predict whether or not a certain company will declare bankruptcy within the next 7 days (by training on data of similar companies that had previously been at risk of bankruptcy).
Would you treat this as a classification or a regression problem?
- Regression
- 🗹 Classification
Suppose you are working on stock market prediction, Typically tens of millions of shares of Microsoft stock are traded (i.e., bought/sold) each day. You would like to predict the number of Microsoft shares that will be traded tomorrow.
Would you treat this as a classification or a regression problem?
- 🗹 Regression
- Classification
Some of the problems below are best addressed using a supervised learning algorithm, and the others with an unsupervised learning algorithm. Which of the following would you apply supervised learning to? (Select all that apply.) In each case, assume some appropriate dataset is available for your algorithm to learn from.
- 🗹 Given historical data of children’s ages and heights, predict children’s height as a function of their age.
- 🗹 Given 50 articles written by male authors, and 50 articles written by female authors, learn to predict the gender of a new manuscript’s author (when the identity of this author is unknown).
- Take a collection of 1000 essays written on the US Economy, and find a way to automatically group these essays into a small number of groups of essays that are somehow “similar” or “related”.
- Examine a large collection of emails that are known to be spam email, to discover if there are sub-types of spam mail.
Some of the problems below are best addressed using a supervised learning algorithm, and the others with an unsupervised learning algorithm. Which of the following would you apply supervised learning to? (Select all that apply.) In each case, assume some appropriate dataset is available for your algorithm to learn from.
- ☐ Given data on how 1000 medical patients respond to an experimental drug (such as effectiveness of the treatment, side effects, etc.), discover whether there are different categories or “types” of patients in terms of how they respond to the drug, and if so what these categories are.
- ☐ Given a large dataset of medical records from patients suffering from heart disease, try to learn whether there might be different clusters of such patients for which we might tailor separate treatments.
- 🗹 Have a computer examine an audio clip of a piece of music, and classify whether or not there are vocals (i.e., a human voice singing) in that audio clip, or if it is a clip of only musical instruments (and no vocals).
- 🗹 Given genetic (DNA) data from a person, predict the odds of him/her developing diabetes over the next 10 years.
Some of the problems below are best addressed using a supervised learning algorithm, and the others with an unsupervised learning algorithm. Which of the following would you apply supervised learning to? (Select all that apply.) In each case, assume some appropriate dataset is available for your algorithm to learn from.
- ☐ Take a collection of 1000 essays written on the US Economy, and find a way to automatically group these essays into a small number of groups of essays that are somehow “similar” or “related”.
- 🗹 Given genetic (DNA) data from a person, predict the odds of him/her developing diabetes over the next 10 years.
- ☐ Examine a large collection of emails that are known to be spam email, to discover if there are sub-types of spam mail.
- 🗹 Examine the statistics of two football teams, and predict which team will win tomorrow’s match (given historical data of teams’ wins/losses to learn from).
Which of these is a reasonable definition of machine learning?
- ☐ Machine learning is the science of programming computers.
- ☐ Machine learning learns from labeled data.
- ☐ Machine learning is the field of allowing robots to act intelligently.
- 🗹 Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed.

Linear Regression with One Variable :

Consider the problem of predicting how well a student does in her second year of college/university, given how well she did in her first year. Specifically, let x be equal to the number of “A” grades (including A-. A and A+ grades) that a student receives in their first year of college (freshmen year). We would like to predict the value of y, which we define as the number of “A” grades they get in their second year (sophomore year).
Here each row is one training example. Recall that in linear regression, our hypothesis is $h_\theta(x)=\theta_0+\theta_1x$ to denote the number of training examples.

For the training set given above (note that this training set may also be referenced in other questions in this $m$)? In the box below, please enter your answer (which should be a number between 0 and 10).
```
4 
```
Many substances that can burn (such as gasoline and alcohol) have a chemical structure based on carbon atoms; for this reason they are called hydrocarbons. A chemist wants to understand how the number of carbon atoms in a molecule affects how much energy is released when that molecule combusts (meaning that it is burned). The chemist obtains the dataset below. In the column on the right, “kJ/mol” is the unit measuring the amount of energy released.

You would like to use linear regression $h_\theta(x) = \theta_0 + \theta_1x$ to estimate the amount of energy released (y) as a function of the number of carbon atoms (x). Which of the following do you think will be the values you obtain for $\theta_0$ and $\theta_1$ ? You should be able to select the right answer without actually implementing linear regression.
- ☐ $\theta_0$ = −569.6, $\theta_1$ = 530.9
- ☐ $\theta_0$ = −1780.0, $\theta_1$ = −530.9
- 🗹 $\theta_0$ = −569.6, $\theta_1$ = −530.9
- ☐ $\theta_0$ = −1780.0, $\theta_1$ = 530.9
For this question, assume that we are using the training set from Q1.
Recall our definition of the cost function was $J(\theta_0, \theta_1 ) = \frac{1}{2m} \sum_{i=1}^{m} (h (x^{(i)} ) - y^{(i)})^2$
What is $J(0,1)$? In the box below,
please enter your answer (Simplify fractions to decimals when entering answer, and ‘.’ as the decimal delimiter e.g., 1.5).
```
0.5
```
Suppose we set $\theta_0 = 0, \theta_1 = 1.5$ in the linear regression hypothesis from Q1. What is $h_\theta(2)$ ?
```
3
```
Suppose we set $\theta_0 = -2, \theta_1 = 0.5$ in the linear regression hypothesis from Q1. What is $h_\theta(6)$?
```
1
```
Let $f$ be some function so that $f(\theta_0 , \theta_1 )$ outputs a number. For this problem, f is some arbitrary/unknown smooth function (not necessarily the cost function of linear regression, so f may have local optima).
Suppose we use gradient descent to try to minimize $f(\theta_0 , \theta_1 )$ as a function of $\theta_0$ and $\theta_1$.
Which of the following statements are true? (Check all that apply.)
- 🗹 If $\theta_0$ and $\theta_1$ are initialized at the global minimum, then one iteration will not change their values.
- ☐ Setting the learning rate $\alpha$ to be very small is not harmful, and can only speed up the convergence of gradient descent.
- ☐ No matter how $\theta_0$ and $\theta_1$ are initialized, so long as $\alpha$ is sufficiently small, we can safely expect gradient descent to converge to the same solution.
- 🗹 If the first few iterations of gradient descent cause $f(\theta_0 , \theta_1)$ to increase rather than decrease, then the most likely cause is that we have set the learning rate $\alpha$ to too large a value.
In the given figure, the cost function $J(\theta_0, \theta_1)$ has been plotted against $\theta_0$ and $\theta_1$, as shown in ‘Plot 2’. The contour plot for the same cost function is given in ‘Plot 1’. Based on the figure, choose the correct options (check all that apply).
- ☐ If we start from point B, gradient descent with a well-chosen learning rate will eventually help us reach at or near point A, as the value of cost function $J(\theta_0, \theta_1)$ is maximum at point A.
- ☐ If we start from point B, gradient descent with a well-chosen learning rate will eventually help us reach at or near point C, as the value of cost function $J(\theta_0, \theta_1)$ is minimum at point C.
- 🗹 Point P (the global minimum of plot 2) corresponds to point A of Plot 1.
- 🗹 If we start from point B, gradient descent with a well-chosen learning rate will eventually help us reach at or near point A, as the value of cost function $J(\theta_0, \theta_1)$ is minimum at A.
- ☐ Point P (The global minimum of plot 2) corresponds to point C of Plot 1.
Suppose that for some linear regression problem (say, predicting housing prices as in the lecture), we have some training set, and for our training set we managed to find some $\theta_0, \theta_1$, such that $J(\theta_0 , \theta_1) = 0$.
Which of the statements below must then be true? (Check all that apply.)
- ☐ Gradient descent is likely to get stuck at a local minimum and fail to find the global minimum.
- ☐ For this to be true, we must have $\theta_0 = 0$ and $\theta_1 = 0$
  so that $h_{\theta}(x) = 0$
- ☐ For this to be true, we must have $y^{(i)} = 0$ for every value of $i = 1, 2,…,m$.
- 🗹 Our training set can be fit perfectly by a straight line, i.e., all of our training examples lie perfectly on some straight line.

Week 4

Logistic Regression :

Suppose that you have trained a logistic regression classifier, and it outputs on a new example a prediction $h_\theta(x) = 0.2$. This means (check all that apply):
- ☐ Our estimate for P(y = 1|x; θ) is 0.8.
- 🗹 Our estimate for P(y = 0|x; θ) is 0.8.
- 🗹 Our estimate for P(y = 1|x; θ) is 0.2.
- ☐ Our estimate for P(y = 0|x; θ) is 0.2.
Suppose you have the following training set, and fit a logistic regression classifier $h_\theta(x) = g(\theta_0 + \theta_1x_1 + \theta_2x_2)$.
Which of the following are true? Check all that apply.
- 🗹 Adding polynomial features (e.g., instead using $h_\theta(x) = g(\theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_1^2 + \theta_4 x_1 x_2 + \theta_5 x_2^2 ))$ could increase how well we can fit the training data.
- 🗹 At the optimal value of θ (e.g., found by fminunc), we will have $J(θ) ≥ 0$.
- ☐ Adding polynomial features (e.g., instead using $h_\theta(x) = g(\theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_1^2 + \theta_4 x_1 x_2 + \theta_5 x_2^2 ))$ would increase $J(θ)$ because we are now summing over more terms.
- ☐ If we train gradient descent for enough iterations, for some examples $x^{(i)}$ in the training set it is possible to obtain $h_\theta(x^{(i)} ) > 1$.
For logistic regression, the gradient is given by $\frac{\partial }{\partial \theta_j } J(\theta) = \frac{1}{m} \sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{i})x^{(i)}_j$. Which of these is a correct gradient descent update for logistic regression with a learning rate of $\alpha$ ? Check all that apply.
- 🗹 $\theta_j := \theta_j - \alpha \frac{1}{m} \sum_{i=1}^m (h_\theta(x^{(i)-y^i}) x^{(i)}_j$ (simultaneously update for all j).
- ☐ $\theta := \theta - \alpha \frac{1}{m} \sum_{i=1}^m (\theta^Tx-y^{(i)}) x^{(i)}$.
- 🗹 $\theta_j := \theta_j - \alpha \frac{1}{m} \sum_{i=1}^m \left(\frac{1}{1+e^{-\theta^Tx^{(i)}}}-y^{(i)}\right) x^{(i)}_j$ (simultaneously update for all j).
- ☐ $\theta_j := \theta_j - \alpha \frac{1}{m} \sum_{i=1}^m (h_\theta(x^{(i)-y^i}) x^{(i)}$ (simultaneously update for all j).
Which of the following statements are true? Check all that apply.
- 🗹 The one-vs-all technique allows you to use logistic regression for problems in which each $y^{(i)}$ comes from a fixed, discrete set of values.
- ☐ For logistic regression, sometimes gradient descent will converge to a local minimum (and fail to find the global minimum). This is the reason we prefer more advanced optimization algorithms such as fminunc (conjugate gradient/BFGS/L-BFGS/etc).
- 🗹 The cost function $J(\theta)$ for logistic regression trained with $m \geq 1$ examples is always greater than or equal to zero.
- ☐ Since we train one classifier when there are two classes, we train two classifiers when there are three classes (and we do one-vs-all classification).
Suppose you train a logistic classifier $h_\theta(x) = g(\theta_0 + \theta_1x_1 + \theta_2x_2)$. Suppose $\theta_0 = 6$, $\theta_1 = -1$, $\theta_2 = 0$. Which of the following figures represents the decision boundary found by your classifier?
- 🗹 Figure:
- ☐ Figure:
- ☐ Figure:
- ☐ Figure:

Regularization

You are training a classification model with logistic regression. Which of the following statements are true? Check all that apply.
- ☐ Introducing regularization to the model always results in equal or better performance on the training set.
- ☐ Introducing regularization to the model always results in equal or better performance on examples not in the training set.
- 🗹 Adding a new feature to the model always results in equal or better performance on the training set.
- ☐ Adding many new features to the model helps prevent overfitting on the training set.
Suppose you ran logistic regression twice, once with $\lambda = 0$, and once with $\lambda = 1$. One of the times, you got parameters $\theta = \begin{bmatrix} 74.81\ 45.05 \end{bmatrix}$, and the other time you got $\theta = \begin{bmatrix} 1.37\ 0.51 \end{bmatrix}$. However, you forgot which value of $\lambda$ corresponds to which value of $\theta$. Which one do you think corresponds to $\lambda = 1$?
- 🗹 $\theta = \begin{bmatrix} 1.37\ 0.51 \end{bmatrix}$
- ☐ $\theta = \begin{bmatrix} 74.81\ 45.05 \end{bmatrix}$
Which of the following statements about regularization are true? Check all that apply.
- ☐ Using a very large value of $\lambda$ hurt the performance of your hypothesis; the only reason we do not set $\lambda$ to be too large is to avoid numerical problems.
- ☐ Because logistic regression outputs values $0 \leq h_\theta(x) \leq 1$, its range of output values can only be “shrunk” slightly by regularization anyway, so regularization is generally not helpful for it.
- 🗹 Consider a classification problem. Adding regularization may cause your classifier to incorrectly classify some training examples (which it had correctly classified when not using regularization, i.e. when $\lambda = 0$).
- ☐ Using too large a value of $\lambda$ can cause your hypothesis to overfit the data; this can be avoided by reducing $\lambda$.
Which of the following statements about regularization are true? Check all that apply.
- ☐ Using a very large value of $\lambda$ hurt the performance of your hypothesis; the only reason we do not set $\lambda$ to be too large is to avoid numerical problems.
- ☐ Because logistic regression outputs values $0 \leq h_\theta(x) \leq 1$, its range of output values can only be “shrunk” slightly by regularization anyway, so regularization is generally not helpful for it.
- ☐ Because regularization causes $J(\theta)$ to no longer be convex, gradient descent may not always converge to the global minimum (when $\lambda > 0$, and when using an appropriate learning rate $\alpha$).
- 🗹 Using too large a value of $\lambda$ can cause your hypothesis to underfit the data; this can be avoided by reducing $\lambda$.
In which one of the following figures do you think the hypothesis has overfit the training set?
- 🗹 Figure:
- ☐ Figure:
- ☐ Figure:
- ☐ Figure:
In which one of the following figures do you think the hypothesis has underfit the training set?
- 🗹 Figure:
- ☐ Figure:
- ☐ Figure:
- ☐ Figure:

Week 5

Neural Networks - Representation :

Which of the following statements are true? Check all that apply.
- 🗹 Any logical function over binary-valued (0 or 1) inputs $x_1$ and $x_2$ can be (approximately) represented using some neural network.
- ☐ Suppose you have a multi-class classification problem with three classes, trained with a 3 layer network. Let $a^{(3)}1 = (h\theta(x))_1$ be the activation of the first output unit, and similarly $a^{(3)}2 = (h\theta(x))_2$ and $a^{(3)}3 = (h\theta(x))_3$. Then for any input x, it must be the case that $a^{(3)}_1 + a^{(3)}_2 + a^{(3)}_3 = 1$.
- ☐ A two layer (one input layer, one output layer; no hidden layer) neural network can represent the XOR function.
- 🗹 The activation values of the hidden units in a neural network, with the sigmoid activation function applied at every layer, are always in the range (0, 1).
Consider the following neural network which takes two binary-valued inputs
$x_1,x_2 \ \epsilon \ {0,1}$ and outputs $h_\theta(x)$. Which of the following logical functions does it (approximately) compute?
- 🗹 AND
- ☐ NAND (meaning “NOT AND”)
- ☐ OR
- ☐ XOR (exclusive OR)
Consider the following neural network which takes two binary-valued inputs
$x_1,x_2 \ \epsilon \ {0,1}$ and outputs $h_\theta(x)$. Which of the following logical functions does it (approximately) compute?
- ☐ AND
- ☐ NAND (meaning “NOT AND”)
- 🗹 OR
- ☐ XOR (exclusive OR)
Consider the neural network given below. Which of the following equations correctly computes the activation $a_1^{(3)}$? Note: $g(z)$ is the sigmoid activation function.
- 🗹 $a_1^{(3)} = g(\theta_{1,0}^{(2)}a_0^{(2)}+\theta_{1,1}^{(2)}a_1^{(2)}+\theta_{1,2}^{(2)}a_2^{(2)})$
- ☐ $a_1^{(3)} = g(\theta_{1,0}^{(2)}a_0^{(1)}+\theta_{1,1}^{(2)}a_1^{(1)}+\theta_{1,2}^{(2)}a_2^{(1)})$
- ☐ $a_1^{(3)} = g(\theta_{1,0}^{(1)}a_0^{(2)}+\theta_{1,1}^{(1)}a_1^{(2)}+\theta_{1,2}^{(1)}a_2^{(2)})$
- ☐ $a_1^{(3)} = g(\theta_{2,0}^{(2)}a_0^{(2)}+\theta_{2,1}^{(2)}a_1^{(2)}+\theta_{2,2}^{(2)}a_2^{(2)})$
You have the following neural network:

You’d like to compute the activations of the hidden layer $a^{(2)} \ \epsilon \ R^3$. One way to do
so is the following Octave code:

You want to have a vectorized implementation of this (i.e., one that does not use for loops). Which of the following implementations correctly compute ? Check all
that apply.
- 🗹 z = Theta1 * x; a2 = sigmoid (z);
- ☐ a2 = sigmoid (x * Theta1);
- ☐ a2 = sigmoid (Theta2 * x);
- ☐ z = sigmoid(x); a2 = sigmoid (Theta1 * z);
You are using the neural network pictured below and have learned the parameters $\theta^{(1)} = \begin{bmatrix} 1 & 1 & 2.4\ 1 & 1.7 & 3.2 \end{bmatrix}$ (used to compute $a^{(2)}$) and $\theta^{(2)} = \begin{bmatrix} 1 & 0.3 & -1.2 \end{bmatrix}$ (used to compute $a^{(3)}$ as a function of $a^{(2)}$). Suppose you swap the parameters for the first hidden layer between its two units so $\theta^{(1)} = \begin{bmatrix} 1 & 1.7 & 3.2 \ 1 & 1 & 2.4 \end{bmatrix}$ and also swap the output layer so $\theta^{(2)} = \begin{bmatrix} 1 & -1.2 & 0.3 \end{bmatrix}$. How will this change the value of the output $h_\theta(x)$?
- 🗹 It will stay the same.
- ☐ It will increase.
- ☐ It will decrease
- ☐ Insufficient information to tell: it may increase or decrease.

Neural Networks: Learning :

You are training a three layer neural network and would like to use backpropagation to compute the gradient of the cost function. In the backpropagation algorithm, one of the steps is to update $\Delta_{ij}^{(2)} := \Delta_{ij}^{(2)} + \delta_i^{(3)} * (a^{(2)})_j$
for every i,j. Which of the following is a correct vectorization of this step?
- ☐ $\Delta^{(2)} := \Delta^{(2)} + \delta^{(2)} * (a^{(3)})^T$
- ☐ $\Delta^{(2)} := \Delta^{(2)} + (a^{(2)})^T * \delta^{(3)}$
- ☐ $\Delta^{(2)} := \Delta^{(2)} + (a^{(2)})^T * \delta^{(2)}$
- 🗹 $\Delta^{(2)} := \Delta^{(2)} + \delta^{(3)} * (a^{(2)})^T$
Suppose Theta1 is a 5x3 matrix, and Theta2 is a 4x6 matrix. You set thetaVec = [Theta1( : ), Theta2( : )]. Which of the following correctly recovers ?
- 🗹 reshape(thetaVec(16 : 39), 4, 6)
- ☐ reshape(thetaVec(15 : 38), 4, 6)
- ☐ reshape(thetaVec(16 : 24), 4, 6)
- ☐ reshape(thetaVec(15 : 39), 4, 6)
- ☐ reshape(thetaVec(16 : 39), 6, 4)
Let $J(\theta) = 2\theta^3 + 2$. Let $\theta = 1$, and $\epsilon = 0.01$. Use the formula $\frac{J{(\theta + \epsilon)}-J{(\theta - \epsilon)}}{2\epsilon}$ to numerically compute an approximation to the derivative at $\theta = 1$. What value do you get? (When $\theta = 1$, the true/exact derivative is $\frac{\mathrm{d} J(\theta)}{\mathrm{d} \theta} = 6$.)
- ☐ 8
- 🗹 6.0002
- ☐ 6
- ☐ 5.9998
Which of the following statements are true? Check all that apply.
- 🗹 For computational efficiency, after we have performed gradient checking to verify that our backpropagation code is correct, we usually disable gradient checking before using backpropagation to train the network.
- ☐ Computing the gradient of the cost function in a neural network has the same efficiency when we use backpropagation or when we numerically compute it using the method of gradient checking.
- 🗹 Using gradient checking can help verify if one’s implementation of backpropagation is bug-free.
- ☐ Gradient checking is useful if we are using one of the advanced optimization methods (such as in fminunc) as our optimization algorithm. However, it serves little purpose if we are using gradient descent.
Which of the following statements are true? Check all that apply.
- 🗹 If we are training a neural network using gradient descent, one reasonable “debugging” step to make sure it is working is to plot $J(\theta)$ as a function of the number of iterations, and make sure it is decreasing (or at least non-increasing) after each iteration.
- ☐ Suppose you have a three layer network with parameters $\theta^{(1)}$ (controlling the function mapping from the inputs to the hidden units) and $\theta^{(2)}$ (controlling the mapping from the hidden units to the outputs). If we set all the elements of $\theta^{(1)}$ to be 0, and all the elements of $\theta^{(2)}$ to be 1, then this suffices for symmetry breaking, since the neurons are no longer all computing the same function of the input.
- 🗹 Suppose you are training a neural network using gradient descent. Depending on your random initialization, your algorithm may converge to different local optima (i.e., if you run the algorithm twice with different random initializations, gradient descent may converge to two different solutions).
- ☐ If we initialize all the parameters of a neural network to ones instead of zeros, this will suffice for the purpose of “symmetry breaking” because the parameters are no longer symmetrically equal to zero.

Week 6

Advice for Applying Machine Learning :

You train a learning algorithm, and find that it has unacceptably high error on the test set. You plot the learning curve, and obtain the figure below. Is the algorithm suffering from high bias, high variance, or neither?
- ☐ High variance
- ☐ Neither
- 🗹 High bias
You train a learning algorithm, and find that it has unacceptably high error on the test set. You plot the learning curve, and obtain the figure below. Is the algorithm suffering from high bias, high variance, or neither?
- 🗹 High variance
- ☐ Neither
- ☐ High bias
Suppose you have implemented regularized logistic regression to classify what object is in an image (i.e., to do object recognition). However, when you test your hypothesis on a new set of images, you find that it makes unacceptably large errors with its predictions on the new images. However, your hypothesis performs well (has low error) on the training set. Which of the following are promising steps to take? Check all that apply.
NOTE: Since the hypothesis performs well (has low error) on the training set, it is suffering from high variance (overfitting)
- ☐ Try adding polynomial features.
- ☐ Use fewer training examples.
- 🗹 Try using a smaller set of features.
- 🗹 Get more training examples.
- ☐ Try evaluating the hypothesis on a cross validation set rather than the test set.
- ☐ Try decreasing the regularization parameter λ.
- 🗹 Try increasing the regularization parameter λ.
Suppose you have implemented regularized logistic regression to predict what items customers will purchase on a web shopping site. However, when you test your hypothesis on a new set of customers, you find that it makes unacceptably large errors in its predictions. Furthermore, the hypothesis performs poorly on the training set. Which of the following might be promising steps to take? Check all that apply.
NOTE: Since the hypothesis performs poorly on the training set, it is suffering from high bias (underfitting)
- ☐ Try increasing the regularization parameter λ.
- 🗹 Try decreasing the regularization parameter λ.
- ☐ Try evaluating the hypothesis on a cross validation set rather than the test set.
- ☐ Use fewer training examples.
- 🗹 Try adding polynomial features.
- ☐ Try using a smaller set of features.
- 🗹 Try to obtain and use additional features.
Which of the following statements are true? Check all that apply.
- ☐ Suppose you are training a regularized linear regression model. The recommended way to choose what value of regularization parameter to use is to choose the value of which gives the lowest test set error.
- ☐ Suppose you are training a regularized linear regression model.The recommended way to choose what value of regularization parameter to use is to choose the value of which gives the lowest training set error.
- 🗹 The performance of a learning algorithm on the training set will typically be better than its performance on the test set.
- 🗹 Suppose you are training a regularized linear regression model. The recommended way to choose what value of regularization parameter to use is to choose the value of which gives the lowest cross validation error.
- 🗹 A typical split of a dataset into training, validation and test sets might be 60% training set, 20% validation set, and 20% test set.
- ☐ Suppose you are training a logistic regression classifier using polynomial features and want to select what degree polynomial (denoted in the lecture videos) to use. After training the classifier on the entire training set, you decide to use a subset of the training examples as a validation set. This will work just as well as having a validation set that is separate (disjoint) from the training set.
- ☐ It is okay to use data from the test set to choose the regularization parameter λ, but not the model parameters (θ).
- 🗹 Suppose you are using linear regression to predict housing prices, and your dataset comes sorted in order of increasing sizes of houses. It is then important to randomly shuffle the dataset before splitting it into training, validation and test sets, so that we don’t have all the smallest houses going into the training set, and all the largest houses going into the test set.
Which of the following statements are true? Check all that apply.
- 🗹 A model with more parameters is more prone to overfitting and typically has higher variance.
- ☐ If the training and test errors are about the same, adding more features will not help improve the results.
- 🗹 If a learning algorithm is suffering from high bias, only adding more training examples may not improve the test error significantly.
- 🗹 If a learning algorithm is suffering from high variance, adding more training examples is likely to improve the test error.
- 🗹 When debugging learning algorithms, it is useful to plot a learning curve to understand if there is a high bias or high variance problem.
- ☐ If a neural network has much lower training error than test error, then adding more layers will help bring the test error down because we can fit the test set better.

Links

10 - Statistic and Probability

Statistic and Probability

11 - MK Machine Learning

MK Machine Learning

Kode: TKE194918
SKS: 3
Jadwal
- TKE194918 Machine Learning A RABU 15:00 - 17:30 GEDUNG TEKNIK E 201 - 12 mhs

Sumber Referensi

Materi Kuliah dari Andrew Ng
Python Data Science
Python ML Course
TensorFlow, Keras and deep learning, without a PhD, Github
CS231n Github, CS231n Github Source
Homemade Machine Learning in Python License: MIT
Machine Learning Octave in Octave License: MIT
Machine Learning Experiments License: MIT
COMS W4995 Applied Machine Learning Spring 2019 - Schedule - Andreas C. Müller - Associate Research Scientist, amueller/COMS4995-s19: COMS W4995 Applied Machine Learning - Spring 19 License: CC0

Tools

Kuliah

Pekan 7-9

Logistic regression - pdf - ppt
Regularization - pdf - ppt
Programming Exercise 2: Logistic Regression - pdf - Problem - Solution
Lecture Notes
Errata
06: Logistic Regression by Holehouse
07: Regularization by Holehouse

Pekan 10-12

Neural Networks: Representation - pdf - ppt
Programming Exercise 3: Multi-class Classification and Neural Networks - pdf - Problem - Solution
Lecture Notes
Errata
Program Exercise Notes
08: Neural Networks - Representation by Holehouse

Pekan 13

Neural Networks: Learning - pdf - ppt
Programming Exercise 4: Neural Networks Learning - pdf - Problem - Solution
Lecture Notes
Errata
Program Exercise Notes
09: Neural Networks - Learning by Holehouse

Visualizing Backpropagation

Pekan 14 : Deep Learning Introduction

TensorFlow, Keras and deep learning, without a PhD, Github

12 - Electronics

Electronics

Electronic Blog

Evil Mad Scientist Laboratories - Making the world a better place, one Evil Mad Scientist at a time.

Electronic Simulator

Everycircuit
Online circuit simulator & schematic editor - CircuitLab
Circuit Diagram Web Editor
Partsim
library.io
Tejotron
hneemann/Digital: A digital logic designer and circuit simulator.
List of Electronic Simulator
SimulIDE : Real Time Electronic Circuit Simulator. With PIC, AVR and Arduino simulation.
Home - QucsStudio
Digital Logic Sim by Sebastian Lague]

Electronic Diagram

EDA Simulator

EDA Playground : Verilog, VHDL

Electronic Forum

Electronic Lab : Electronic Forum
Forum for Electronics

Electronic Book

Ultimate Electronics Book
Socratic Electronics License: CC-BY

Electronic Tutorial

13 - Free Online Course

Free Online Course

Online Course Platform

Coursera
Edx

List of Free Online Course

Hacking Satellite Course

Crash Course for Hacking Satellite

Machine Learning Course

Machine Learning University from Amazon, Information from Amazon but these are mostly designed around Amazon products and do not teach much actual ML
DEEP LEARNING Course Yann LeCun & Alfredo Canziani MATERIAL Google Drive, Notebooks NYU Site
AI Course: Elements of AI
Complete ML Coursework

Programming

Deep Learning Course

Time Series Course

Welcome to STAT 510! - STAT 510

Machine Learning Course

DeepCourse License: Apache

Linear Algebra

Electrical Circuit Course Notes

Circuit 2014 Fall

Machine Learning Course Notes

Course

microsoft/IoT-For-Beginners: 12 Weeks, 24 Lessons, IoT for All! License: MIT

Course

Introduction to Deep Learning

Machine Learning

Machine Learning Crash Course - Google Developers
MLOps Course - Made With ML License MIT

Course

Full stack open 2021 License: CC-BY-NC-SA

HTML Learn

Course

MIT Open Learning Library - Open Learning License: CC-BY-NC-SA

Course

MIT 6.874/6.802/20.390/20.490/HST.506: Computational Systems Biology: Deep Learning in the Life Sciences - Spring 2021 - YouTube

Programming Course

CMU CS Academy

Machine Learning Course

Machine Learning

Google’s ML crash course
TensorFlow tutorials
PyTorch tutorials
FastAI course: Practical Deep Learning for Coders
Full Stack Deep Learning course
A Software Engineer’s trek into Machine Learning - Towards Data Science

Machine Learning Course

NeuromatchAcademy/course-content: Summer course content for Neuromatch Academy

Machine Learning Course

Open Course

The Missing Semester of Your CS Education · the missing semester of your cs education License: CC-NC

NLP Course

Transformer models - Hugging Face Course

Course

Lesson Directory - Programming Historian Digital Tools for Humanity, License: CC-BY

Course

Applied Compositional Thinking for Engineers–Applied Compositional Thinking for Engineers

Distributed Systems

Distributed Systems Course

Course

lijqhs/deeplearning-notes: Notes for Deep Learning Specialization Courses led by Andrew Ng. License: MIT
https://web.stanford.edu/~jurafsky/NLPCourseraSlides.html Lecture Slides from the 2012 Stanford Coursera course
ISLR Textbook Slides, Videos and Resources Introduction to Statistical Learning: With Applications in R Lecture Slides and Videos
Entire Computer Science Curriculum in 1000 YouTube Videos - Laconicml
- Computer Science Currriculum in 1000 Videos (no Ads)

Fast.ai Online Course

Course

CSC321 Intro to Neural Networks and Machine Learning Winter 2018, also:

Computer Science

Developer-Y/cs-video-courses: List of Computer Science courses with video lectures.

Full Stack Deep Learning

Full Stack Deep Learning - Full Stack Deep Learning

Course Self Taught

Course

Introduction to Reinforcement Learning with David Silver (deepmind.com)

Natural Language Processing (NLP) for Semantic Search - Pinecone

14 - Machine Learning by Andrew Ng Resources

Machine Learning by Andrew Ng Resources

Main Course

Coursera : Machine Learning by Andrew Ng
Youtube Playlists
Video lectures Index https://class.coursera.org/ml/lecture/preview
Programming Exercise Tutorials https://www.coursera.org/learn/machine-learning/discussions/all/threads/m0ZdvjSrEeWddiIAC9pDDA
Programming Exercise Test Cases https://www.coursera.org/learn/machine-learning/discussions/all/threads/0SxufTSrEeWPACIACw4G5w
Useful Resources https://www.coursera.org/learn/machine-learning/resources/NrY2G

More Machine Learning Courses

Udemy Top Machine Learning Courses Online - Updated April 2021

Suplementary Notes

Holehouse Notes : review by holehouse
Kaggle Notes
Vkosuri Notes : ppt, pdf, course, errata notes, Github Repo
Danlu Zhang : review by Danlu Zhang
CSEAV
Stanford : quiz discussion

Suplementary Codes

Fengdu78 : ppt, code in python (ipynb)
dibgerge : assignment code in python (ipynb)
Kaleko : assignment code in python (ipynb)
nsoojin : code in python
lucasshenv : code in python (ipynb) using Tensorflow
AvaisP : assignment code in Octave
Benlau93 : assignment code in Python
worldveil: code, pdf
dibgerge/ml-coursera-python-assignments: Python assignments for the machine learning class by andrew ng on coursera with complete submission for grading capability and re-written instructions.

Extra Information

Machine Learning Online E Books

Introduction to Machine Learning by Nils J. Nilsson free
Introduction to Machine Learning by Alex Smola and S.V.N. Vishwanathan free
Introduction to Data Science by Jeffrey Stanton free
Bayesian Reasoning and Machine Learning by David Barber free
Understanding Machine Learning, © 2014 by Shai Shalev-Shwartz and Shai Ben-David free
Elements of Statistical Learning, by Hastie, Tibshirani, and Friedman free
Pattern Recognition and Machine Learning, by Christopher M. Bishop free, used
Master Machine Learning Algorithms: Discover How They Work and Implement Them From Scratch
Jason Brownlee, proprietary, used
Course in Machine Learning free, used

Machine Learning Tutorial

Trekhleb Machine Learning with Octave, free, used
Trekhleb Machine Learning with Python, free, used
Trekhleb Deep Learning with Python, free, used
Tutorials Point: Machine Learning with Python, used
ML Cheatsheet free, used

Machine Learning Youtube

StatQuest with Josh Stamer

15 - Instrumentation

Instrumentation

Socratic Instrumentation

16 - Digital Signal Processing

Digital Signal Processing

Signal Processing Jupyter Notebooks

Sound Analysis with the Fourier Transform. A set of IPython Notebooks by Caleb Madrigal to explain what the Fourier Transform is and how to use it for basic audio processing applications.
An introduction to Compressed Sensing, part of Python for Signal Processing: an entire book (and blog) on the subject by Jose Unpingco.
Kalman and Bayesian Filters in Python. A textbook and accompanying filtering library on the topic of Kalman filtering and other related Bayesian filtering techniques.
Classify human movements using Dynamic Time Warping & K Nearest Neighbors: Signals from a smart phone gyroscope and accelerometer are used to classify if the person is running, walking, sitting standing etc. This IPython notebook contains a python implementation of DTW and KNN algorithms along with explanations and a practical application.
Digital Signal Processing A collection of notebooks that accompanies a masters course on the topic.
An introduction to openCV An introduction course into using openCV for computer vision in python
Signal: Filtering, STFT, and Laplace Transform Filtering signal with a butterworth low-pass filter and plotting the STFT of it with a Hanning window and then plotting the Laplace transform.

Tools

Filter Design Tools

Filter Design Tool web based
RF Tools - LC Filter Design Tool web based
Filter Design and Analysis web based
TFilter - Free online FIR filter design web based
FIIIR! web based
FIR Filter Designer web based
List of FIR Filter tools

Tutorial

Audio Programming

Pure Data—Pd Community Site Pure Data (or just Pd) is an open source visual programming language for multimedia.
elk.audio Audio Operating Systems
VCV Rack - The Eurorack Simulator for Windows/Mac/Linux
Sassy by sol_hsa Sassy is an audio spreadsheet. Or, as it stands, it’s THE audio spreadsheet.
JUCE - JUCE The leading framework for multi-platform audio applications
Tone.js

DSP Notes

DSP Tools

DSP Books

DSP Lectures

Digital Signal Processing Lecture License: CC-BY

DSP Interactive

Fourier Transform
Premier on Digital Signal Processing, Github, License: Eclipse Public License

Software Defined Radio

Music Retrieval Course

Music Information Retrieval License: MIT

Speech Recognition

Libre ASR: An On-Premises, Streaming Speech Recognition System

Signal Processing Notes

Exploring Sound : Why does an A note on a piano sound different from an A note on a violin?
Everything you need to know about surround sound in headphones - SoundGuys
HeSuVi download - SourceForge.net
Headphone 7.1 Surround Comparison (GSX vs SBX vs Atmos vs CMSS vs DH vs DTSH:X vs Sonic vs HRTF) - YouTube

Signal Processing

Free Books on Signal Processing

DSP: THEORY

The Scientist and Engineer’s Guide to Digital Signal Processing- Steven W. Smith
Introduction to Signal Processing -Sophocles J. Orfanidis
Astronomical Image and Data Analysis -JL Starck and F Murtagh
The theory of linear prediction- Vaidyanathan, P. P.
Introduction to Statistical Signal Processing - R.M. Gray
Mixed Signal and DSP Design Techniques - edited by Walt Kester
Modern Signal Processing - Edited by Edited by Daniel N. Rockmore and Dennis M. Healy
Advances in Signal Transforms: Theory and Applications - Edited by: J. Astola, and L. Yaroslavsky
Advances in Nonlinear Signal and Image Processing -Edited by: Stephen Marshall and Giovanni L. Sicuranza
The Data Conversion Handbook - Walt Kester
Mathematics Of The Discrete Fourier Transform (DFT) - Julius O. Smith III
Principles of Sigma-Delta Modulation for A/D Converters - Sangil Park
Using the ADSP-2100 Family Vol. 1 & Vol. 2 -Analog Devices Inc.
A Technical Tutorial on Digital Signal Synthesis-Analog Devices Inc.

DSP: COMMUNICATIONS

Signal Processing for Communications -Paolo Prandoni and Martin Vetterli
Signals, Samples and Stuff: A DSP Tutorial: Part 1, Part 2, Part 3, Part 4 - Doug Smith
FAQs on Digital Signal Processing-
Wireless Communications: Signal Processing Perspectives-Poor and Wornell
Signal Processing with Fractals: A Wavelet-Based Approach-G. W. Wornell
Wireless Communications: Signal Processing Perspectives-Poor and Wornell
Stochastic Processes, Detection and Estimation-A. S. Willsky and G. W. Wornell

DSP: IMAGE PROCESSING

Fundamentals of Image Processing - Young, Gerbrands and Vliet
Advances in Nonlinear Signal and Image Processing -Edited by: Stephen Marshall and Giovanni L. Sicuranza
Image Processing and Data Analysis: The Multiscale Approach -JL Starck, F Murtagh and A Bijaoui
Principles of Computerized Tomographic Imaging - Kak and Slaney
IMAGE ESTIMATION BY EXAMPLE: Geophysical Soundings Image Construction - Jon Claerbout and Sergey Fomel
BASIC EARTH IMAGING- Jon Claerbout
EARTH SOUNDINGS ANALYSIS: Processing versus Inversion - Jon Claerbout
IMAGING THE EARTH’S INTERIOR- Jon Claerbout
FUNDAMENTALS OF GEOPHYSICAL DATA PROCESSING - Jon Claerbout
Genetic and Evolutionary Computation for Image Processing and Analysis -Stefano Cagnoni, Evelyne Lutton, and Gustavo Olague
Advances in Nonlinear Signal and Image Processing -Edited by: Stephen Marshall and Giovanni L. Sicuranza
Image Processing in C: Analyzing and Enhancing Digital ImagesDwayne Phillips

DSP: AUDIO

Introduction to Sound Processing -Davide Rocchesso
Introduction To Digital Filters, With Audio Applications -Julius Smith
Mathematics of the Discrete Fourier Transform (DFT), With Audio Applications -Julius Smith
Physical Audio Signal Processing For Virtual Musical Instruments and Audio Effects -Julius Smith
High-Fidelity Multichannel Audio Coding - Dai Tracy Yang, Chris Kyriakakis, and C.-C. Jay Kuo
Physical Audio Signal Processing-Julius O. Smith III
Spectral Audio Signal Processing -Julius O. Smith III

DSP: SPECTRAL ANALYSIS

Bayesian Spectrum Analysis and Parameter Estimation -G. Larry Bretthorst
Chebyshev and Fourier Spectral Methods - John Boyd
The Temporal and Spectral Characteristics of Ultrawideband Signals -William Kissick

DSP: MISCELLANEOUS TOPICS

Biomedical Digital Signal Processing -Willis J. Tompkins
Stochastic Optimal Control: The Discrete-Time Case -Bertsekas
Signal Processing with Fractals: A Wavelet-Based Approach - Gregory Wornell
Nonlinear Systems Theory: The Volterra/Wiener Approach -Wilson Rugh
Detection of Abrupt Changes - Theory and Application -Basseville and Nikiforov
An Introduction to Signal Processing in Chemical Analysiy - T. OHaver
Multimedia Fingerprinting Forensics for Traitor Tracing -K. J. Ray Liu, Wade Trappe, Z. Jane Wang, Min Wu, and Hong Zhao
Genomic Signal Processing and Statistics -Edited by:Dougherty, Shmulevich, Chen, and Wang

DSP: IMPLEMENTATION

Computer Aids for VLSI Design -Steven Rubin
Application-Specific Integrated Circuits - Michael Smith
The VHDL Cookbook -Peter Ashenden
Controlling Noise and Radiation in Mixed-Signal and Digital Systems - Nicholas Gray

Free Books on Signal Processing II

Introduction to Digital Signal Processing - Paolo Prandoni
Efficient Digital Fiilters -Matthew Donadio
Discrete-Time Signal Processing - MIT
Modern Signal Processing- Edited by Daniel N. Rockmore and Dennis M. Healy, Jr.
Signals and Systems - MIT

Signal Processing

Self-Paced Online Courses Signal Processing with Matlab

17 - MK Sistem Kendali Lanjut

MK Sistem Kendali Lanjut

Kode: TKE193154
SKS: 3
Jadwal 2020
- TKE193154 Sistem Kendali Lanjut A JUMAT 13:20 - 15:50 GEDUNG TEKNIK E 204 - 15 mhs

Referensi

Norman S. Nise, Control Systems Engineering [website]
Katsuhiko Ogata, Modern Control Engineering
Richard C. Dorf and Robert H. Bishop, Modern Control Systems [website]
Farid Golnaraghi and Benjamin C. Kuo, Automatic Control Systems [website]
Brian Douglas, The Fundamentals of Control Theory [website][ebook]
Pao C. Chau, Process Control: A First Course With MATLAB [website]
Karl J. Åström and Richard M. Murray, Feedback Systems: An Introduction for Scientists and Engineers [website]
R.V. Dukkipati, Analysis and Design of Control Systems using MATLAB
Ricone Website

Software

Online Course

Umich Control Tutorials

Online Video Course

Kuliah

Pekan-1

Pekan-2

equ=[1 2 3] %characteristic equation polynomial
roots(equ)

Pekan-3

pkg load control
num=[1] %numerator
den=[1 2 3] %denumerator
sys=tf(num,den) %transfer function
rlocus(sys)

Pekan-4

Desain Sistem Kendali dengan Root Locus
Video Pendukung:

Tugas

Persiapan
- Silakan presensi dulu di Eldiru pada tanggal 26 Desember
- Akses situs Interactive Course for Control Theory
- Buat akun ICCT, cek email untuk mendapatkan username dan password
- Login ke Interactive Course for Control Theory
- Untuk mempermudah silakan akses video berikut
Latihan Jupyter Notebook di ICCT
- Anda akan berinteraksi dengan Jupyter Notebook di ICCT
- Klik folder ICCT pada Jupyter Notebook, lalu klik Table-of-Contents-ICCT.ipynb
- Klik kanan, open di new tab file Link 1.1.1 Complex Numbers in Cartesian Form di folder 1.1 Complex Numbers
- Anda berada di Jupyter Notebook M-01_Complex_numbers_Cartesian_form.ipynb
- Pilih menu lalu
- Silakan baca Notebook-nya, baca penjelasan atau penugasaannya.
- Lalu anda ubah nilai bilangan kompleksnya, tekan atau
- Lalu anda variasikan operasinya seperti , dll.
- Anda bisa unduh atau screenshoot citranya.
- Pilih menu lalu untuk mematikan Jupyter Notebook.
Tugas (dengan waktu 2 pekan)
- Sesuai dengan distribusi (terlampir di Eldiru), lakukan hal sebagai berikut:
- Jalankan berkas Jupyter Notebook sebagaimana yang didistribusikan kepada anda.
- Untuk setiap berkas Jupyter Notebook buat laporan mini dalam berkas .docx atau .odt yang terdiri dari:
  - Judul, disertai penjelasan (dalam terjemah bahasa Indonesia) dari berkas Jupyter Notebook. (Kode Python pada Jupyter Notebook tak perlu disertakan.)
  - Pembahasan. Pembahasan ringkas dari aktivitas yang anda lakukan, jika perlu lengkapi unduhan gambar (screenshot).
- Simpan setiap berkas dalam nama NIM-TugasXXX.docx misalnya H1A018091-Tugas385.odt. Gabungkan ketiga berkas penugasan dalam file .zip lalu unggah ke laman Assignment di Eldiru.

Istilah Sistem Kendali

Bandwidth and 3dB. The bandwidth of a band pass filter is the frequency range that is allowed to pass through with minimal attenuation. The frequency at which the power level of the signal decreases by 3 dB from its maximum value is called the 3 dB bandwidth. A 3 dB decrease in power means the signal power becomes half of its maximum value. This occurs when the output voltage has dropped to $1/{\sqrt{2}}$ (~0.707) of the maximum output voltage and the power has dropped by half (since $P=V^2/R$. Exact: $20\log _{10}\left({\tfrac {1}{\sqrt {2}}}\right)\approx -3.0103\ \mathrm {dB}$
Half-power point - Wikipedia

18 - Machine Learning CS299

Machine Learning CS299

Brandon McKinzie for CS299 etc.
PythonAndr for CS299 by Andrew Ng

Deep Learning Specialization by Andrew Ng

Machine Learning Course

Practical Deep Learning

rajatkb/Practical-Deep-Learning
sjchoi86/dl_tutorials: Deep Learning Presentation and Tutorial License: MIT

Visualizing Backpropagation

Machine Learning Course

19 - MK Dasar Teknik Elektro

MK Dasar Teknik Elektro

Kode: TKE191113
SKS: 2
Jadwal
- TKE191121 Dasar Teknik Elektro B RABU 10:20 - 12:00 GEDUNG TEKNIK E 104 - 46 mhs
- TKE191121 Dasar Teknik Elektro A RABU 12:30 - 14:10 GEDUNG TEKNIK E 101 - 44 mhs

Capaian Pembelajaran Lulusan (CPL) Program Studi

Pengetahuan-PU03 : menguasai pengetahuan keteknikan dan ilmu komputasi untuk menganalisa dan merancang piranti listrik dan elektronik kompleks, perangkat lunak, dan sistem yang terdiri dari komponen perangkat keras dan perangkat lunak;
Pengetahuan-PU04 : menguasai pengetahuan inti (core knowledge) bidang teknik elektro meliputi: rangkaian elektrik, sistem dan sinyal, sistem digital, elektromagnetik, dan elektronika, beserta penerapan mereka;
Keterampilan Khusus-KK02 : mampu menerapkan pengetahuan matematika, sains dasar, dan topik keteknikan dalam bidang teknik elektro;

Capaian Pembelajaran Mata Kuliah (CPMK)

Memahami pengetahuan matematika dan sains dasar, dan topik keteknikan dalam bidang teknik elektro;
Memahami lingkup dasar-dasar keteknikan dan ilmu komputasi yang diperlukan untuk menganalisis dan merancang
- piranti listrik,
- piranti elektronik,
- perangkat lunak, dan
- sistem (perangkat lunak dan perangkat keras);
Memahami lingkup pengetahuan inti (core knowledge) bidang teknik elektro meliputi: rangkaian elektrik, sistem dan sinyal, sistem digital, elektromagnetik, dan elektronika, beserta penerapan mereka;

Bahan Kajian

Ikhtisar pengetahuan matematika dan sains dasar untuk bidang teknik elektro
Ikhtisar pengetahuan keteknikan untuk teknik elektro
Ikhtisar topik keteknikan (terkini) dalam bidang teknik elektro
Ikhtisar ilmu komputasi untuk teknik elektro
Ikhtisar metode analisis dan perancangan piranti listrik dan elektronik
Ikhtisar metode analisis dan perancangan perangkat lunak
Pengenalan rangkaian elektrik dan penerapannya di bidang teknik elektro
Pengenalan sistem dan sinyal dan penerapannya di bidang teknik elektro
Pengenalan sistem digital dan penerapannya di bidang teknik elektro
Pengenalan elektronika dan penerapannya di bidang teknik elektro

Referensi

Referensi Bebas dan Terbuka

Referensi Berbayar

Referensi Kuliah Online

Kuliah

Pekan-1

Topik
- Pendahuluan
- Ikhtisar Ilmu Teknik Elektro
Video Pendukung

Pekan-2

Topik:
- Ikhtisar Dasar-dasar Keteknikan untuk Teknik Elektro
Tugas:
- Terjemah: Electrical Engineering: Know It All by Clive Maxfield et.al.

Pekan-3

Topik:
- Pengenalan Sinyal dan Sistem

20 - Kuliah

Kuliah

202020212

TKE192221 Sistem Kendali A [FAR; IMR; ],[2019]; C 201 Selasa 07.55, 2
TKE192221 Sistem Kendali B [FAR; IMR; ],[2019]; C 201 Selasa 10.40, 2
TKE192227 Pengolahan Sinyal Digital B [AZS; IMR; ],[2019]; E 101 Rabu 07.00, 3
TKE192227 Pengolahan Sinyal Digital A [AZS; IMR; ],[2019]; E 101 Rabu 09.45, 3
TKE194945 Internet of Things A [AZS; IMR; ],[2018]; E 205 Kamis 13.00, 3
TKE194941 Sistem Kendali Cerdas A [IMR; AGU; ],[2018]; E 201 Jumat 13.55, 3

202020211

TKE194917 Sistem Adaptif A SELASA 09:30 - 12:00 GEDUNG TEKNIK E 202
TKE191121 Dasar Teknik Elektro B RABU 10:20 - 12:00 GEDUNG TEKNIK E 104
TKE191121 Dasar Teknik Elektro A RABU 12:30 - 14:10 GEDUNG TEKNIK E 101
TKE194918 Machine Learning A RABU 15:00 - 17:30 GEDUNG TEKNIK E 201 - 12 mhs
TKE191113 Matematika Teknik A KAMIS 07:00 - 09:30 GEDUNG TEKNIK C 101
TKE191113 Matematika Teknik B KAMIS 09:30 - 12:00 GEDUNG TEKNIK C 101
TKE194021 Proyek Keteknikan A JUMAT 07:50 - 09:30 GEDUNG TEKNIK C 103 - 1 mhs
TKE193154 Sistem Kendali Lanjut A JUMAT 13:20 - 15:50 GEDUNG TEKNIK E 204

201920202

TKE192227 Pengolahan Sinyal Digital
TKE192221 Sistem Kendali
TKE194941 Sistem Kendali Cerdas

201920201

TKE191121 Dasar Teknik Elektro
TKE193153 Sistem Kendali Digital

201820192

TKE132207 Pengolahan Sinyal Digita
TKE134103 Proyek Keteknikan
TKE132201 Sistem Kontrol

201820191

TKE131104 Dasar Teknik Elektro
TKE134026 Jaringan Sensor
TKE134103 Proyek Keteknikan
TKE134033 Sistem Adaptif

201720182

TKE133201 Instrumentasi
TKE131201 Metode Transformasi
TKE132207 Pengolahan Sinyal Digital
TKE132201 Sistem Kontrol

201720181

TKE131104 Dasar Teknik Elektro
TKE134026 Jaringan Sensor
TKE132102 Matematika Teknik
TKE134103 Proyek Keteknikan
TKE134033 Sistem Adaptif

21 - Linear Algebra

Linear Algebra

Software

PC

SpeedCrunch–open source software, fast, simple
GNU Octave–open source software
Online Octave - online GNU Octave
Matlab–proprietary
Anaconda for Python Data Science Programming

Android

MOOC

edX - Linear Algebra - Foundations to Frontiers : good interactive HW exercises, very clear instruction and time-efficient
MIT OCW - Linear Algebra by Gilbert Strang Youtube: Gilbert Strang is good.
Coursera - Mathematics for Machine Learning: Linear Algebra
Algebra 1 - Khan Academy and Algebra 2 - Khan Academy

Youtube

List of Books

Proprietary Books

Free Books

An Intuitive Overview of Linear Algebra Fundamentals
Introduction to Linear Algebra by Thomas L. Scofield - PDF
Algebra by Paul Dawkins PDF
Linear Algebra Abridged by Sheldon Axler
Intuitive Overview of Linear Algebra Fundamentals
Linear Algebra A Course for Physicists and Engineers by Arak M. Mathai License: CC-BY-NC-ND
Linear Algebra License: CC-BY-NC
Math 1410 Elementary Linear Algebra by Sean Fitzpatrick License: CC-BY-NC
Lecture Notes for Math 3410, with Computational Examples by Sean Fitzpatrick License: CC-BY-NC
Linear Algebra with Application by Keith Nicholson PDF License: CC-BY-NC
Immersive Linear Algebra
Introduction to Applied Linear Algebra–Vectors, Matrices, and Least Squares

Open Books

Linear Algebra, Theory And Applications by Kenneth Kuttler PDF License: CC-BY
Linear Algebra by Jim Hefferon License : GFDL and CC-SA
Interactive Linear Algebras by Dan Margalit License: GPL/GFDL Pretext Book
Discover Linear Algebra by Jeremy Sylvestre License: GFDL Pretext Book
- 1-Semester Discover Linear Algebra
- 2-Semesters Discover Linear Algebra
A First Course in Linear Algebra by Robert A. Beezer PDF or its public beta version of A First Course in Linear Algebra License: GFDL
Understanding Linear Algebra by David Austin License: CC-BY Pretext Book
Linear Algebra, Theory And Applications by Kenneth Kuttler PDF License: CC-BY
Linear Algebra for ML License: MIT
Open Resources for Community College Algebra (ORCCA) or the book License: CC-BY Pretext Book
MATH 1220 Linear Algebra 1 by Michael Doob License: CC (?) Pretext Book
Elements of Linear and Multilinear Algebra by John M. Erdman PDF License: CC-BY

Book Recommendation

Video Recommendation

Table of Contents some of the Open Books

Interactive Linear Algebras by Dan Margalit

Systems of Linear Equations: Algebra (pp 1-27)
Systems of Linear Equations: Geometry (pp 29-112)
Linear Transformations and Matrix Algebra (pp 113-185)
Determinants (pp 187-235)
Eigenvalues and Eigenvectors (pp 237-337)
Orthogonality (pp 339-407)

Discover Linear Algebra by Jeremy Sylvestre

Systems of Equations and Matrices (pp 7-169)
- Systems of linear equations
- Solving systems using matrices
- Using systems of equations
- Matrices and matrix operations
- Matrix inverses
- Elementary matrices
- Special forms of square matrices
- Determinants
- Determinants versus row operations
- Determinants, the adjoint, and inverses
Vector Spaces (pp 170-374)
- Introduction to vectors
- Geometry of vectors
- Orthogonal vectors
- Geometry of linear systems
- Abstract vector spaces
- Subspaces
- Linear independence
- Basis and Coordinates
- Dimension
- Column, row, and null spaces
Introduction to Matrix Forms (pp 375-413)
- Eigenvalues and eigenvectors
- Diagonalization

Linear Algebra by Jim Hefferon

Linear Systems
- Solving Linear Systems
- Linear Geometry
- Reduced Echelon Form
Vector Spaces
- Definition of Vector Space
- Linear Independence
- Basis and Dimension
Maps Between Spaces
- Isomorphisms
- Homomorphisms
- Computing Linear Maps
- Matrix Operations
- Change of Basis
- Projection
Determinants
- Definition
- Geometry of Determinants
- Laplace’s Formula
Similarity
- Complex Vector Spaces
- Similarity
- Nilpotence
- Jordan Form

A First Course in Linear Algebra by Robert A. Beezer

Systems of Linear Equations
Vectors
Matrices
Vector Spaces
Determinants
Eigenvalues
Linear Transformations
Representations

Linear Algebra, Theory And Applications by Kenneth Kuttler

Preliminaries
Matrices and Linear Transformations
Determinants
Row Operations
Some Factorizations
Linear Programming
Spectral Theory
Vector Spaces and Fields
Linear Transformations
Linear Transformations Canonical Forms
Markov Chains and Migration Processes
Inner Product Spaces
Self Adjoint Operators
Norms for Finite Dimensional Vector Spaces
Numerical Methods for Finding Eigenvalues

MATH 1220 Linear Algebra 1 by Michael Doob

Systems of Linear Equations
Matrix theory
The Determinant
Vectors in Euclidean n-space
Eigenvalues and eigenvectors
Linear transformations

Markov Chains

22 - MK Matematika Teknik

MK Matematika Teknik

Kode: TKE191113
SKS: 3
Jadwal:
- TKE191113 Matematika Teknik A KAMIS 07:00 - 09:30 GEDUNG TEKNIK C 101 - 65 mhs
- TKE191113 Matematika Teknik B KAMIS 09:30 - 12:00 GEDUNG TEKNIK C 101 - 44 mhs

Capaian Pembelajaran Lulusan (CPL) Program Studi

Pengetahuan-PU01 : menguasai pengetahuan matematika lanjut meliputi kalkulus integraldiferensial, persamaan diferensial, aljabar linier, variable kompleks, probabilitas dan statistik, dan matematika diskret serta penerapan mereka di bidang teknik elektro;

Capaian Pembelajaran Mata Kuliah

Menguasai metode penyelesaian persamaan linear secara analitis dan numeris
Menguasai operasi terhadap matriks dan penerapannya
Menguasai konsep eigenvalue dan eigenvektor dan penerapannya
Menguasai konsep vektor dan ruang vektor serta penerapannya
Menguasai konsep transformasi linear dan penerapannya

Bahan Kajian

Sistem Persamaan Linear
Matriks dan Operasi Matriks
Eigenvalue dan Eigenvektor
Dekomposisi LU
Diagonalisasi dan Bentuk Kuadrat (tambahan)
Ruang Vektor Euklid
Ruang Vektor Umum
Transformasi Linear
Aplikasi Aljabar Linear di bidang Teknik Elektro

Referensi

Referensi Berbayar

Referensi Bebas Terbuka

Kuliah Online

Youtube

Software

PC

SpeedCrunch–open source software, fast, simple
GNU Octave–open source software
Online Octave - online GNU Octave
Matlab–proprietary
Anaconda for Python Data Science Programming

Android

Kuliah

Pekan-1

Topik:
- Pendahuluan
- Sistem Persamaan Linear
Buku Pendukung:

Pekan-2

Topik:
- Matriks : Operasi

Pekan-3

Topik:
- Matriks : Invers dan Dekomposisi LU
Catatan :
- Mengapa ada dekomposisi LU? Karena dekomposisi LU lebih cepat dipakai untuk menyelesaikan linear persamaan dengan beragam vektor b, dibandingkan eliminasi Gauss yang harus mengulang langkah untuk setiap vektor b (Ref-1, Ref-2).

23 - Control Design with Frequency Method

Control Design with Frequency Method

Perbandingan Metode Root Locus (RL) dan Respon Frekuensi (RF)

Pada desain respon transien dan stabilitas dengan pengaturan gain (gain adjustment)
- RF lebih mudah, gain dapat diperoleh dari Bode Plot
Pada desain respon transien dengan kompensasi seri (cascade compensation)
- RF tidak se-intuitif RL
- di RL titik tertentu diketahui memiliki karakteristik respon transien tertentu
- di RF :
  - phase margin terkait dengan persen overshoot
  - bandwidth terkait dengan damping ration dan settling time serta peak time
Pada desain steady-state error dengan kompensasi seri
- di RF dapat dirancang kompensasi yang memperbaiki respon transien dan steady state error secara bersamaan.
- di RL ada banyak solusi yang memungkinkan untuk membuat kompensator (yang setiap solusinya akan memunculkan isu steady state error).

Desain Respon Frekuensi

Sistem yang open loop-nya stabil akan stabil di closed-loop jika magnitude respon frekuensi open loop memiliki gain kurang dari 0 dB pada frekuensi yang mana fase-nya adalah 180 derajat
Persen overshoot dikurangi dengan meningkatkan phase margin
Respon dipercepat dengan meningkatkan bandwidth
Steady state error diperbaiki dengan meningkatkan magnitude respon pada frekuensi rendah

Perbaikan Respon Transien dengan Pengaturan Gain (Gain Adjustment)

Damping ratio ($\zeta$) (dan persen overshoot) dan PM (phase margin)

Diagram Bode untuk Gain Adjusment

24 - Control Systems

Control Systems

Reference

Norman S. Nise, Control Systems Engineering [website]
Katsuhiko Ogata, Modern Control Engineering
Richard C. Dorf and Robert H. Bishop, Modern Control Systems [website]
Farid Golnaraghi and Benjamin C. Kuo, Automatic Control Systems [website]
Brian Douglas, The Fundamentals of Control Theory [website][ebook]
Pao C. Chau, Process Control: A First Course With MATLAB [website]
Karl J. Åström and Richard M. Murray, Feedback Systems: An Introduction for Scientists and Engineers [website]
R.V. Dukkipati, Analysis and Design of Control Systems using MATLAB

Online Book

CSA - Your Controls Resource
Book: Introduction to Control Systems (Iqbal) - Engineering LibreTexts License: CC-BY-NC
Book: Chemical Process Dynamics and Controls (Woolf) - Engineering LibreTexts License: CC-BY
Linear Physical Systems Analysis

Interactive Learning

Specific Topics

Control Theory Map

Software

Octave Online

Interactive Control Systems Learning

Online Video Course

Control Learning Videos

Control Theory Interactive

ICCT

Control Systems Academy - https://www.controlsystemsacademy.com/
CBE30338: https://jckantor.github.io/CBE30338/
Linear Physica l Systems Analysis: https://lpsa.swarthmore.edu/
Python in Education (Institute of Control Theory): https://tu-dresden.de/ing/elektrotechnik/rst/studium/python-in-der-lehre?set_language=en
Computational Methods for Control of Infinite-dimensional Systems - Institute for Mathematics and its Applications

Python Control

Intelligent Control

About the Book - DATA DRIVEN SCIENCE & ENGINEERING
- dynamicslab/databook_matlab: Matlab files with demo code intended as a companion to the book “Data-Driven Science and Engineering: Machine Learning, Dynamical Systems, and Control” by J. Nathan Kutz and Steven L. Brunton https://www.databookuw.com/
- dylewsky/Data_Driven_Science_Python_Demos: IPython notebooks with demo code intended as a companion to the book “Data-Driven Science and Engineering: Machine Learning, Dynamical Systems, and Control” by J. Nathan Kutz and Steven L. Brunton

Control Systems Online Curriculum

Level 1:

Math basics:
1. Algebra 1: https://www.khanacademy.org/math/algebra and https://www.khanacademy.org/math/algebra2
2. Trig: https://www.khanacademy.org/math/trigonometry
3. Basic Calculus: https://www.khanacademy.org/math/ap-calculus-ab and https://www.khanacademy.org/math/ap-calculus-bc
Physics Basics:
1. General Physics: https://www.khanacademy.org/science/physics
2. More “advanced” general physics: https://www.khanacademy.org/science/ap-physics-1 and https://www.khanacademy.org/science/ap-physics-2
MATLAB Basics:
1. https://ocw.mit.edu/courses/mathematics/18-s997-introduction-to-matlab-programming-fall-2011/index.htm

Level 2:

Intermediate Math:
1. Linear Algebra: https://www.khanacademy.org/math/linear-algebra
2. Differential Equations: https://www.khanacademy.org/math/differential-equations
Intermediate Physics:
1. Calculus based Mechanics at the college level: https://ocw.mit.edu/courses/physics/8-012-physics-i-classical-mechanics-fall-2008/index.htm
2. E&M: https://ocw.mit.edu/courses/physics/8-02-physics-ii-electricity-and-magnetism-spring-2007/index.htm
3. Waves and vibrations: https://ocw.mit.edu/courses/physics/8-03-physics-iii-spring-2003/index.htm
Intro to Simulink: https://ctms.engin.umich.edu/CTMS/index.php?example=Introduction&section=SimulinkModeling

Level 3:

More rigorous math courses:
1. Multivariable Calculus: https://www.khanacademy.org/math/multivariable-calculus
2. Higher level linear algebra: https://ocw.mit.edu/courses/mathematics/18-06-linear-algebra-spring-2010/index.htm
3. Higher level differential equations: https://ocw.mit.edu/courses/mathematics/18-03-differential-equations-spring-2010/
More rigorous physics:
1. Mechanics 2: https://ocw.mit.edu/courses/physics/8-223-classical-mechanics-ii-january-iap-2017/
2. Mechanics 3: https://ocw.mit.edu/courses/physics/8-09-classical-mechanics-iii-fall-2014/
Beginning Engineering:
1. Electrical:
  1. Circuits: https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-002-circuits-and-electronics-spring-2007/
  2. Signals: https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-003-signals-and-systems-fall-2011/
2. Mechanical:
  1. Beginning dynamics: https://ocw.mit.edu/courses/mechanical-engineering/2-003sc-engineering-dynamics-fall-2011/syllabus/
  2. More Dynamics and intro to control: https://ocw.mit.edu/courses/mechanical-engineering/2-003j-dynamics-and-control-i-spring-2007/index.htm

Level 4:

Helpful Math:
1. Beginning Stats: https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-041sc-probabilistic-systems-analysis-and-applied-probability-fall-2013/
Signal Processing:
1. Signals and systems: https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-011-signals-systems-and-inference-spring-2018/syllabus/
Control:
1. Dynamics and control 2: https://ocw.mit.edu/courses/mechanical-engineering/2-004-dynamics-and-control-ii-spring-2008/index.htm
2. More systems and control: https://ocw.mit.edu/courses/mechanical-engineering/2-04a-systems-and-controls-spring-2013/index.htm
3. Feedback Control: https://ocw.mit.edu/courses/aeronautics-and-astronautics/16-30-feedback-control-systems-fall-2010/index.htm
4. More intro control: https://www.edx.org/course/introduction-control-system-design-first-mitx-6-302-0x?utm_source=OCW&utm_medium=CHP&utm_campaign=OCW
5. More state space intro: https://www.edx.org/course/introduction-state-space-control-mitx-6-302-1x?utm_source=OCW&utm_medium=CHP&utm_campaign=OCW
6. Recommended Resources for this level in addition/ to help with the courses above, these will also help with some of the “higher” level stuff:
  1. katkimshow Intro to control: https://www.youtube.com/playlist?list=PLmK1EnKxphikZ4mmCz2NccSnHZb7v1wV-
  2. Brian Douglas Control System Lectures: https://www.youtube.com/playlist?list=PLUMWjy5jgHK3j74Z5Tq6Tso1fSfVWZC8L
  3. Steve Brunton Control Bootcamp: https://www.youtube.com/playlist?list=PLMrJAkhIeNNR20Mz-VpzgfQs5zrYi085m

Level 5:

Optional Math:
1. Complex Variable: https://ocw.mit.edu/courses/mathematics/18-04-complex-variables-with-applications-fall-2003/
2. A course designed to help intuition: https://ocw.mit.edu/courses/mathematics/18-098-street-fighting-mathematics-january-iap-2008/index.htm
More rigorous practice in signals and systems:
1. Graduate signals processing: https://ocw.mit.edu/courses/mechanical-engineering/2-161-signal-processing-continuous-and-discrete-fall-2008/lecture-notes/
Control:
1. Higher level dynamics and control: https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-241j-dynamic-systems-and-control-spring-2011/index.htm
2. Higher level feedback control: https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-302-feedback-systems-spring-2007/calendar/
3. Slightly higher level control: https://ocw.mit.edu/courses/mechanical-engineering/2-14-analysis-and-design-of-feedback-control-systems-spring-2014/index.htm
4. Multi-variable control systems: https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-245-multivariable-control-systems-spring-2004/index.htm

Level 6:

Optional Nonlinear Dynamics:
1. Chaos: https://ocw.mit.edu/courses/mathematics/18-353j-nonlinear-dynamics-i-chaos-fall-2012/index.htm
2. Continuum: https://ocw.mit.edu/courses/mathematics/18-354j-nonlinear-dynamics-ii-continuum-systems-spring-2015/
Non-Linear control:
1. More theory based: https://web.mit.edu/nsl/www/videos/lectures.html
2. More practice based: https://www.youtube.com/watch?v=9xDZy5mE-3I&list=PLrxYXaxBXgRoqgaBlitaAA_sgVZ8V6Teg (note, videos in english except introduction)
  1. Resources for these videos: https://sites.google.com/a/g2.nctu.edu.tw/nonlinear-control-systems-2017-fall/course-materials

Level 7:

More advanced, but optional, non-linear dynamics:
1. Chaos: https://ocw.mit.edu/courses/mathematics/18-385j-nonlinear-dynamics-and-chaos-fall-2014/index.htm
2. Waves: https://ocw.mit.edu/courses/mechanical-engineering/2-034j-nonlinear-dynamics-and-waves-spring-2007/index.htm
Control:
1. Sliding mode: https://www.youtube.com/watch?v=x9WxwM6Ebvo (Note, this is the only videos or online materials I can find in a course-manner on sliding mode, please suggest more if you find them)
2. Optimal and Robust control: https://www.youtube.com/watch?v=z64cXTZKw4I&list=PLMLojHoA_QPmRiPotD_TnfdUkglTexuqm\

Control eBook

25 - Fundamentals of Electrical Engineering

Fundamentals of Electrical Engineering

Free and Open References

Lesson of Electrical Circuit, All About Circuit Version License: Design Science License
All about Circuits Worksheets, original worksheet License: CC-BY
Modular Electronics Learning Project License: CC-BY
Fundamentals of Electrical Engineering I (PDF) by Don H. Johnson or Fundamentals of Electrical Engineering in OpenStax License: CC-BY

Free References

Proprietary References

MOOC Course

26 - Course

Course

Kuliah Teknik Elektro Unsoed

[[kuliah|Kuliah]]
[[mk-dasar-teknik-elektro|MK Dasar Teknik Elektro]]
[[mk-internet-of-things|MK Internet of Things]]
[[mk-machine-learning|MK Machine Learning]]
[[mk-matematika-teknik|MK Matematika Teknik]]
[[mk-pengolahan-sinyal-digital|MK Pengolahan Sinyal Digital]]
[[mk-sistem-kendali|MK Sistem Kendali]]
[[mk-sistem-kendali-cerdas|MK Sistem Kendali Cerdas]]
[[mk-sistem-kendali-lanjut|MK Sistem Kendali Lanjut]]

Online Courses

Course Resources

computer-science
Control System
- control-design-frequency
- control-systems
electronics
embedded-system
fundamentals-of-electrical-engineering
- rangkaian-listrik
instrumentation
linear-algebra
machine-learning
statistic-probability

27 - Computer Science

Computer Science

Stanford CS Curriculum - Google Drive

Computer Science

28 - Control Systems Resources

Control Systems Resources

Control

Model predictive control python toolbox—do-mpc 4.0.0 documentation
CasADi Optimal Control

Control System

Introduction to Model Predictive Control · Arnav’s Weblog

Control

Kevin M. Passino’s Home Page

29 - Electronic Resources

Electronic Resources

Electronics

A Summary of Electronics - Electroagenda

Electronics

30 - Math Resources

Math Resources

Math

Wolfram MathWorld: The Web’s Most Extensive Mathematics Resource

Math Puzzle

Math

The Map of Mathematics

Calculus

Mila Computer Calculus RG Reading about differential, integral and logical calculi.

Learn Math

Math—Susan Rigetti

Course

Course

1 - CS231n Resources

CS231n Resources

How To

Student Notes

Student Forums

CS231 Assignment

Similar with CS231n

EECS498 Student Notes

CS294-129 Student Notes

CS498 DL Student Notes

Related Books

2 - Standford CS231n 2017 Summary

Standford CS231n 2017 Summary

Table of contents

Course Info

01. Introduction to CNN for visual recognition

02. Image classification

03. Loss function and optimization

04. Introduction to Neural network

05. Convolutional neural networks (CNNs)

06. Training neural networks I

07. Training neural networks II

08. Deep learning software

09. CNN architectures

10. Recurrent Neural networks

11. Detection and Segmentation

12. Visualizing and Understanding

13. Generative models

14. Deep reinforcement learning

15. Efficient Methods and Hardware for Deep Learning

16. Adversarial Examples and Adversarial Training

3 - MK Internet of Things

MK Internet of Things

4 - MK Pengolahan Sinyal Digital

MK Pengolahan Sinyal Digital

Identitas

Materi

Referensi

Link

5 - MK Sistem Kendali Cerdas

MK Sistem Kendali Cerdas

Identitas

Materi

Referensi Utama

Referensi Tambahan

Software Links

Video Links

E-learning Link

Neuro-fuzzy in Python

Libraries

Downgrade Python for installing keras and tensorflow

6 - MK Sistem Kendali

MK Sistem Kendali

Identitas MK

Referensi

Software

Interactive Learning

Video

Kuliah

01-Pendahuluan Sistem Kendali

Interactive Course for Control Theory

Pertemuan 2

Pertemuan 3

Pertemuan 4

Pertemuan 5

Pertemuan 6

Pertemuan 7

UTS

Pertemuan 8

Pertemuan 9

Pertemuan 10

Pertemuan 11

Pertemuan 12

Pertemuan 13

Pertemuan 14

UAS

7 - Rangkaian Listrik

Rangkaian Listrik