AI: bump vocab_size from 256 to 34816 using byte pair encoding
BPE
My byte-pair encoding (BPE) implementation works.
PR: https://github.com/sebhtml/novigrad/pull/17
I increased vocab_size from 256 to 34816 with BPE.
GPU
Thanks to NVIDIA, I can do this on my laptop that has a NVIDIA RTX 4060.
Tokens
The text I am using has 78970 characters.
With my BPE implementation, this text is encoded into 44513 tokens.
Loss
The loss goes from 108.2 to 0.
Epoch 0 Total_error 108.20975, change: NaN
Epoch 100 Total_error 0, change: -1
Epoch 300 Total_error 0, change: NaN
Example 0
input_text: {{short description|Video game franchise}}
{{About|the vi
expected_output_text: de
input_tokens: [22796, 30981, 11071, 20834, 19622, 31405, 30073, 11, 23050, 12550, 13, 32014, 19623, 16841, 6070, 17, 125, 28674, 1460, 15833, 20052, 8, 22923, 20, 22797, 21, 24292, 31531, 13, 5289, 126, 24990]
expected_output_token: [19624]
Example 0 before training:
Loss 11.230275
actual_output_token: [13142]
actual_output_text: me
Example 0 after training:
Loss -0
actual_output_token: [19624]
actual_output_text: de
Weirdness
The weird thing is that the neural network can generate a token that is not known to the BPE decoder. In this case, I chose to simply emit '?'.
Comments