ERSP Teaching Topic

Topic: Transformer architecture

Best sources:

https://medium.com/dissecting-bert/dissecting-bert-part-1-d3c3d495cdb3

https://jalammar.github.io/illustrated-transformer/

Part 1:

What is the function and architecture behind the multilayer bidirectional transform encoder?
- Things to understand first
  - Sequence to sequence architecture is a neural network
    - Sequence to sequence models are good at translations, or the act of transforming a sequence into another sequence
    - The sequence to sequence architecture uses sequences of elements such as characters and changes it into another sequence of characters
  - The sequence to sequence architecture also involves an encoder and decoder
    - The encoder transforms the sequences into an n-dimensional vector
    - The decoder transforms the n-dimensional vector back into an output of desired type
  - Attention
    - The attention mechanism looks at an input sequence between each sequence to sequence transformation, and determines which parts of the sequence are more or less important
    - This attention mechanism signals to the encoder to pass on the more important words to the decoder which helps train the decoder to do its job
  - Embeddings
    - The computer does not actually recognize words, it recognizes them as tokens which they can assign an adjustable embedding
    - The embedding is a point representation of the word’s location in an “embedding space”
      - Essentially in this embedding space words that are similar are clustered together
    - Positional Encoder
      - This is a vector encoding of the position of the word in a sentence since context matters for words
      - This is then applied to the embedding of a token and produces an embedding that is given context
- The transformer
  - The transformer is also an architecture which transforms from sequence to sequence using encoders and decoders but without using Recurring Networks
  - One huge advantage of the transformer architect is that it is fast because it can work in parallel
    - What does that mean?
      - Recurring Neural Networks are slow to train, and are slow to translate because their input sequence must be passed sequentially while adjusting embeddings as each input in the sequence is passed
      - This does not work well with current GPU architecture because GPUs are designed to perform many operations at the same time in parallel
      - The transformer fixes this by being able to accept all of the inputs in the sequence in parallel and generate the embeddings in parallel accordingly
  - Transformer components
    - Input embedding
    - Positional encoding
    - Encoder Block
      - Multi-Head Attention Layer
        
        Determines what part of the input that we should focus on?
        
        In other words, what i’th word of a sequence is most important
        
        $Z = \text{softmax}(\frac{Q*K^T}{\sqrt{ \text{Dimension of vector } Q, K \text{or} V}})$
      - Feed Forward Layer
        
        A simple feed forward neural network which is applied to each attention vector to make it readable for the following encoder or decoder
    - Decoder Block
      - Outputs
      - Outputs Embedding
      - Positional Encoding
      - Masked Multi-Head Attention Layer
      - Multi-Head Attention Layer
      - Feed Forward Layer
      - Linear Layer
      - Softmax Layer
        
        Converts into Probability Distribution
    - After each layer apply some sort of normalization
      - Batch Normalization
  - Both encoder and decoder are made up of modules which can be stacked on top of each other
    - The modules mainly consist of a Multi-Head Attention and Feed Forward layers
Attention is all you need
- https://www.youtube.com/watch?v=TQQlZhbC5ps
Bert Architecture
- https://www.youtube.com/watch?v=xI0HHN5XKDo&t=199s

Part 2:

A clearly defined topic that you will teach. What do you want your "student" to be able to know and do at the end of the session?
- I want my students to have slightly better understanding of the matrix manipulation that goes on behind BERT’s encoder architecture.
- I also hope that by learning about more of the smaller details of the encoder’s architecture, they can understand the bigger picture of why BERT is faster than other models as well as more accurate
Explanations including visual aids (could be drawings on paper, or images printed or on your computer, or just the same material that helped you learn it, which you will explain more fully) that will help you teach the topic to your student
- https://docs.google.com/presentation/d/1meIMtDR7JIO4i6Gl2AX-IYHE13fJXDoqUPzEoakbE8E/edit?usp=sharing
- And see presentation notes below
At least one exercise or set of problems or review questions that your "student" will perform or answer at the end of your session. You will use this to test how well they understood what you taught them, but it is also part of the learning itself.
- See the bottom denoted “Pop Quiz!”

Presentation Notes:

How Does BERT’s Encoder Work?
- https://iq.opengenus.org/embeddings-in-bert/
- https://medium.com/dissecting-bert/dissecting-bert-part-1-d3c3d495cdb3
What does BERT’s encoder look like?
- So here is a picture of the transformer architecture, and here is BERT’s architecture
- As you can see, BERT main architecture simply utilizes the encoder part of the Transformer architecture, so that is what we are going to be focusing on today
  - The encoder is responsible for determining the attention / relationships between the words in the input
  - Since the encoder is a sequence to sequence architecture, at the end it passes another sequence with adjusted weights
  - Therefore, since we are only interested right now in pre-training a model, we do not need an encoder for this process
What are the parts of the encoder?
- As you can see in the picture, the encoder is made up of a multi-head attention layer, a normalization layer, a feed forward layer, and another normalization layer
- A brief overview:
  - The multi-head attention layer
    - The multi-head attention layer is primarily responsible for determining the self-attention weight values of the different words in the input
  - The normalization layer
    - The normalization layer normalizes the inputs across the features
    - Two types of normalization can be used, batch normalization and layer normalization
    - However, both involve taking the mean and standard deviation of either the columns or rows of the embedding matrices and normalizing the respective columns or rows
  - The feed forward layer
  - The normalization layer
Embeddings