BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Part 1:

Read the paper's abstract. Then record in your reading log what you think the main problem the paper is solving is, and what you think the paper's main contribution is. This should be in your own words.
1. The problem that the paper is trying to solve is that modern natural language processing models still are limited in their capacity of understanding the human language the same way that humans do.
2. The paper’s main contribution is that they have developed and are proposing a new model which they claim is both conceptually simple and yet powerful. It also claims that their model is suitable to be adapted to a wide range of tasks without needing a large amount of fine tuning the model itself.
Now, read the introduction, and highlight or underline any words, terms or concepts that you do not understand, but don't stop to try to figure them out, yet. When you have finished reading the introduction, answer the following in your reading log:
1. Give more detail about the specific problem(s) that this paper is addressing. While your answer above might have been very general, you should try to give more detail here.
  1. The specific main problem that this paper is addressing is that the two current strategies for applying pre-trained language models, feature-based and fine-tuning, are limited in their accuracy due to them utilizing unidirectional language models to learn languages. This problem is especially pronounced when the unidirectional model is used for token-level tasks.
2. What terms or concepts are you confused about, and how important do you think each will be to your understanding of the paper?
  1. The terminology that I am confused about are feature-based, fine-tuning, pre-training, natural language inference, token level, unidirectional language models, self-attention layers, Transformer, masked language model, and pre-training objective. For most of these terms, I can get an abstract understanding of what they mean based on their context, however I think that understanding the design behind these things will be vital to learn how the BERT model is designed.
Next, skip to the "Experiments" or "Results" section and read it, again continuing to highlight (but not worry too much about) things you don't understand. Then, in your log record the answer to this questions: "How does the author show that their approach is successful? What evidence does the author provide? Is it compelling"
1. One way that the authors showed that their approach is successful was by comparing the BERT model to other models according to the GLUE test. The results were displayed in a chart, showing that BERT scored higher in every category of the test. They also measured their model with the Stanford Question Answering Dataset which tests the model’s ability to answer questions from reading a passage. Again, the BERT model outperformed all other models, with an even larger margin after finetuning with TriviaQA set. Finally, they compare the BERT model using SWAG and show that in this case it too outperforms other models.
Now, read the "Background" or "Related Work" section. After you are done, answer this question in your log: "How does the work in this paper build on what has been done before?" Your answer doesn't have to be perfect. Just give it your best shot. You'll answer this question again on your next pass.
1. BERT builds upon past conventions of natural language processing models. It utilizes unsupervised feature-based approaches and unsupervised fine-tuning approaches to adapt its model for different purposes. In this section the authors describe past models that pioneered these concepts such as ELMo which developed context-sensitive features in a unidirectional model as well as OpenAI GPT which utilized pre-training to obtain better results. BERT improves upon the concept of ELMo’s double unidirectional model into a deeply bidirectional model. And adapts OpenAI GPT’s pre-training method so that less parameters need to be learned from scratch.
Finally, read the "meat" of the paper which describes the new approach or technique that the paper proposes. Continue to highlight what you don't understand, but also highlight ares that you feel are important. Use a different annotation for each, so you can distinguish between them. When you are finished, answer these questions in your reading log:
1. To the best of your understanding, what is proposed in this paper? The answer to this question should contain as much detail as possible, and should be a few paragraphs long, most likely.
  1. The paper proposes the use of BERT, a state of the art natural language processing model. In the BERT section, the authors describe in more detail what the design of BERT actually is.
    1. The introduction of this section introduces the broad overview of its architecture. Most notably, they state that there are two steps to their framework, pre training and fine tuning. For pre training, the model is trained on unlabeled data across different pre-training tasks while for fine tuning, the BERT model builds upon the pre-trained model by learning from the labeled data of its specific task. They also state that the BERT’s architecture is of a multi-layer bidirectional Transformer encoder. Finally, they contextualize the input/output representation that the model uses. In this section, they state that the input is able to accept both single sentences and pairs of sentences while recognizing each input. This allows the model able to be very flexible in its specific tasks that it is modified for. For its output, the sentence pairs are packed back together into a single sequence or vector. However, these sequences can be deciphered because the sentences are separated by a special token and learned embeddings.
    2. The pre training section explains how the pre training process is executed for BERT.
      1. They explain that unidirectional models can not be directly adapted to bidirectional since the bidirectional condition would allow the words to “see itself” and predict something that it then already knows making this process trivial. To train the deep bidirectional representation, the authors mask a percentage of random input tokens with the aim of predicting those tokens. The number that they use is typically 15% of the tokens are masked. However, because the MASK token may interfere with the fine-tuning process, during the selection of those 15% tokens, they only replace the token with the MASK token 80% of the time with the other two 10%’s being leaving the same or replacing with a random token.
      2. To capture the relationship between sentences, BERT uses a binarized next sentence prediction task where they randomly assign the actual next sentence 50% of the time or a random sentence. This way, the model can be trained to recognize a logical flow between sentences which is important for language synthesis.
2. What are the key concepts/terms/ideas that are blocking your more complete understanding of this paper? These should be the things you highlighted as confusions that you feel are most important for you to understand in order to understand this paper.
  1. The main key concept that is blocking my full understanding of the paper is what the function and architecture behind the multilayer bidirectional Transform encoder is. The paper does not explain transformers as it is a commonly used convention, however since I am unfamiliar with the subject I should look more into this topic in order to contextualize the paper.
Finally, read quickly or skim any sections that you have not yet read, though at this point these sections should consist only perhaps of the discussion and the conclusion. There is nothing to record for this part.

Part 2.1:

What is the function and architecture behind the multilayer bidirectional transform encoder?
- https://medium.com/inside-machine-learning/what-is-a-transformer-d07dd1fbec04
- Things to understand first
  - Sequence to sequence architecture
    - Sequence to sequence models are good at translations, or the act of transforming a sequence into another sequence
    - The sequence to sequence architecture uses sequences of elements such as characters and changes it into another sequence of characters
  - The sequence to sequence to sequence architecture also involves an encoder and decoder
    - The encoder transforms the sequences into an n-dimensional vector
    - The decoder transforms the n-dimensional vector back into an output of desired type
  - Attention
    - The attention mechanism looks at an input sequence between each sequence to sequence transformation, and determines which parts of the sequence are more or less important
    - This attention mechanism signals to the encoder to pass on the more important words to the decoder which helps train the decoder to do its job
- The transformer
  - The transformer is also an architecture which transforms from sequence to sequence using encoders and decoders but without using Recurring Networks
  - Both encoder and decoder are made up of modules which can be stacked on top of each other
    - The modules mainly consist of a Multi-Head Attention and Feed Forward layers

Part 2.2:

What is your takeaway message from this paper?
1. My key takeaway message from this paper is that the BERT architecture can be a powerful and more efficient tool for creating new NLP models for different downstream purposes, such as for our research project.
What is the motivation for this work (both people and technical problem), and its distillation into a research question? Why doesn't the problem have a trivial solution? What are the previous solutions and why are they inadequate?
1. The motivation for this work is to train machines to better synthesize or understand human language in order to improve the automation of more tasks in the future. The technical problem is to create a more accurate architecture that can accomplish this while being able to be replicated within a realistic amount of time and data.
2. The problem doesn’t have a trivial solution because training machine’s to learn is a very complex task. Through many layers of abstraction, scientists have to translate data into mathematical form, then create new algorithms to train the machine based on different datasets. Human tasks are incredibly complex in themselves, so although computers have lots of brute power, the ability to adapt is difficult to implement.
3. The previous solutions to natural language processing models were shallow bidirectional models or iterative unidirectional models. Because of their inability to parse sequences from both directions, they lack the ability to factor in context from both sides, a crucial part of inference in human language.
What is the proposed solution? Why is it believed it will work? How does it represent an improvement? How is the solution achieved?
1. The proposed solution is the BERT architecture which uses a deep bidirectional transformer to train models with additional context from both directions. Intuitively it makes sense why it would work, because more context means there should be more data to make more accurate predictions. The BERT architecture achieves this solution by building upon the transformer to essentially read the entire sequence of words at once rather than iteratively. This makes it bidirectional, as each prediction it makes to train itself can be made with the rest of the data at its disposal already.
What is the author's evaluation of the solution? What logic, argument, evidence, artifacts (e.g., a proof-of-concept system), or experiments are presented in support of the idea?
1. The author’s experiments section gives convincing statistics of two of BERT model’s performances on several well-respected benchmarking datasets such as GLUE, SQuAD, and SWAG. In all of these datasets, the BERT models had a higher accuracy than all of the other contemporary models.
2. The author’s ablation experiments studies section also shows proof of concept of the BERT system.
  1. When removing the pre-training tasks, the authors show that it is still possible to create a model, however this will decrease overall efficiency as well as accuracy of the program.
  2. By training two models, BERT${BASE}$, and BERT${LARGE}$, the models show continual improvement of accuracy when using larger datasets. The concept was conceptually proven before, however these models give proof of concept to the idea that continual increases in model size will lead to not only better performance on large tasks, but also small scale tasks.
What is your analysis of the identified problem, idea and evaluation? Is this a good idea? What flaws do you perceive in the work? What are the most interesting or controversial ideas? For work that has practical implications, ask whether this will work, who would want it, what it will take to give it to them, and when it might become a reality?
1. My analysis of the problem is that teaching machines natural language processing is a potentially incredibly useful feature for automation in the future. However, the problem persists that these architectures are still far from being used reliably for non-trivial tasks. The BERT model is an incredible leap in both accuracy, performance, and flexibility of an NLP model. However, I believe the problem of biases will still arise, especially when expanding datasets to include more niche language interactions involving marginalized communities. This worry however may be solved in the future as the authors explain that by continuously increasing the dataset of training data, small-scale accuracy improves continuously as well. Undeniably though, BERT takes a large step in the direction of this future.
2. For trivial downstream processes, I think BERT provides a great increase of value, especially for commercial uses.
What are the paper's contributions (author's and your opinion)? Ideas, methods, software, experimental results, experimental techniques...?
1. The author’s main contribution is the BERT model itself, which is based on their introduction of the bidirectional transcoder.
What are the future directions for this research (author's and hours, perhaps driven by shortcomings or other critiques)?
1. I believe that the most important future direction of this research is to continue to figure out how to better account for biases. However, the success of this transformer may also be able to be transformed to similar fields such as computer vision.
What questions are you left with? What questions would you like to raise in an open discussion of the work (review interesting and controversial points above)? What do you find difficult to understand? List as many as you can.
1. The concepts that I still find difficult to understand is the implication of the bidirectional transformer encoder. Though I now understand an abstract idea of the inner workings of it, it is still difficult for me to wrap my head around how the transformer accomplishes bidirectionality without iteration.