- Why use new methods of metric measuring?
- Can be used to improve scores on certain models by measuring success through a different metric?
- Previous methods
- Rule-based
- Supervised Metric
- Are all errors created equal?
- Different severity levels of errors based on what kind of error they are
- e.g. “cat person” or “people person”
- Error hierarchy
- MQM Human Annotations
- give score based upon the severity of errors in a sentence and sum them up to get a final score