Goal:
- Trying to build a text classification model to predict articles based on the hg news dataset
- There are 4 categories [0,3]
Rough idea of steps:
- Import dataset from huggingface, etc.
- Tokenize datasets
- Create a bag of words for each article
- Potentially store the bag of words in a tensor, or a hash function + vector, or dictionary
- Perform tf - idf (term frequency - in document frequency)
Bag of words:
- Imagines each article as a “bag of words”
- Stores the frequency of terms from an article in a dictionary
- However, this does not actually