HG News Text Classification Model | Notion

Goal:

Trying to build a text classification model to predict articles based on the hg news dataset
There are 4 categories [0,3]

Rough idea of steps:

Import dataset from huggingface, etc.
Tokenize datasets
- Tokenize by space
Create a bag of words for each article
- Potentially store the bag of words in a tensor, or a hash function + vector, or dictionary
Perform tf - idf (term frequency - in document frequency)

Bag of words:

Imagines each article as a “bag of words”
Stores the frequency of terms from an article in a dictionary
However, this does not actually