# introduction to deep learning

- VISTA: synthesizing environments for autonomous vehicles to train in
- Don’t have to send the vehicles out into the real world to train; can do so through simulation!

- Artificial intelligence (AI): techniques that allow computers to mimic human behavior
- Machine learning (ML): train a machine to make decisions based on a set of data, but not explicitly programming it to make decisions
- Deep learning (DL): extracting patterns from neural networks to fill in the gaps
- Machine extracts core patterns, then applies it to new data
- This is as opposed to when humans hand pick and define correct and incorrect data and feed it into a machine to learn (ML)

- Machine extracts core patterns, then applies it to new data
**Perceptron**: a single neuron- Composed of inputs, weights, a bias, nonlinear activation function, and summation
- Steps to get the output of a perceptron ($y^ $):
- multiply inputs by weights
- sum
- add nonlinearity

- What is a
**nonlinear function**? Why is it useful?- What: a function that takes any real number and maps it to a specific range
- Example: the sigmoid function maps all numbers to be between [0, 1] using the function $g(z)=1+e_{−z}1 $

- Why: introduce nonlinearity into the network

- What: a function that takes any real number and maps it to a specific range
- Deep neural networks are just neural networks with many hidden layers
**Loss functions**tell the neural network how big of a mistake it made, given the predicted value and the true value- Loss optimization involves minimizing the loss value — we want to find NN weights that will achieve this
- Gradient descent:
- For any point, we can compute the gradient of the loss function for that point, and tweak the value such that the loss value decreases
- Repeat this until convergence to the minimum value

- Gradient descent:
**Backpropagation:**computing the weight of the node by applying the chain rule on the loss function from the output to the input- Training NNs in practice is difficult
- What is the learning rate? How do we set it?
- Small LRs converge slowly and occasionally get stuck in local minima
- Large LRs overshoot and diverge and the NN doesn’t train
- The learning rate can be an algorithm that adapts to the landscape (possible weights)

- What are batches? Why are they useful?
- It’s not feasible to compute the gradient over the entire dataset because the dataset is too large
- Take a small “batch” (sample) of the dataset and compute the gradient over that instead of the entire set or a single point
- Using mini-batches allows for smoother convergence and faster training (allows for parallelization per batch!!)

- What is overfitting? How do we correct for it?
- A model that has overfit to a dataset means it has followed training data too well and can’t generalize to other datasets
- Regularization is used to help discourage overfitting and is introduced into the NN
- Dropout: randomly set some neurons to be 0
- Early stopping: stop training before the model is good enough to overfit

- What is the learning rate? How do we set it?
- Summary:
- What is the perceptron? What are its parts? What is a nonlinear activation function and why is it useful?
- How do we get from a single perceptron to a NN? How does the NN learn? What is backpropagation and what is its relation to weight calculation?
- What are some techniques applied in practice that allow for NNs to be accurate?

# recurrent neural networks, transformers, and attention

- Sequential data are when the points in the dataset depend on other points, for example, sound waves defining audio or a timeseries such as a stock market
- For working with sequential data, we define recurrence relations $h_{t}$ for a time step $t$ which retains information about the state which the NN was in when it produced the output $y^ _{t}$
- Since the state of the NN is tracked for each output, then our output now depends on the state, $y^ _{t}=f(x_{t},h_{t−1})$

- Formally,
**recurrent neural networks (RNNs)**track state $h_{t}$ which is updated each time step- Given an input vector, update the hidden state (which has its own weight matrix)
- Then combine the input’s weight matrix and the hidden state’s weight matrix with nonlinearity to get output
- Loss in calculated for each individual timestep and them combined into an overall loss value
- Backpropagation occurs through each individual timestep, then from the current time all the way to the beginning

- Issues with backpropagating in RNNs
- Computing the gradient wrt the initial input requires that you perform gradient calculations on many versions of the state weight matrix
- Exploding gradients > gradient clipping
- Gradients keep increasing and get extremely large
- Scale back large gradients by clipping

- Vanishing gradients
- Harder time capturing long term dependencies because many small numbers are being multiplied together

- Activation function tweaking
- Parameter initialization via setting weights to identity matrices
- Gated cells: using gates to filter information per recurrent unit (LSTM)

- What should we keep in mind when designing models for sequences?
- Variable-length of the sequences
- Dependencies between data points that are distant from each other
- Maintaining order
- One set of weights can be applied to any timestep input and still work

- Encoding language for NNs: transform words into vector representations
- One-hot embedding
- Obtain a vocabulary (corpus)
- Map each word to an index
- A word is a vector of 0s with a 1 at the word’s index

- Cons to one-hot embedding: the words have no meaning to each other
- Learned embedding: use a NN to learn an embedding

- One-hot embedding
- Limitations of RNNs in application
- Encoding bottlenecks: in the case of many-to-one models (sentiment classification), how do we encode all the text to just one single result without losing information?
- Slow: can’t parallelize because every step depends on the previous one
- No long term memory

**Attention:**how can we eliminate the need for recurrence and improve on the above issues, but still analyze data in sequential order?- Identify the most important features in the input
- Encode position information (embedding)
- Extract query, key, value for search — what is the most important information related to my request?
- Compute attention weighing — compute pairwise similarity between each query and key
- How similar are any two features?
- Computed using the dot product between query and key, called the cosine similarity or the similarity metric

- Extract features with high attention

- Identify the most important features in the input
- Summary:
- What are RNNs? What are its key features? What are they useful for?
- How do RNNs perform backpropagation and what are some major issues of backpropagating in RNNs? What are some downsides of RNNs?
- What is attention? What is its importance with relation to RNNs? Why is it important?