AI: bump vocab_size from 256 to 34816 using byte pair encoding

- May 01, 2024

BPE

My byte-pair encoding (BPE) implementation works.

PR: https://github.com/sebhtml/novigrad/pull/17

I increased vocab_size from 256 to 34816 with BPE.

GPU

Thanks to NVIDIA, I can do this on my laptop that has a NVIDIA RTX 4060.

Tokens

The text I am using has 78970 characters.

With my BPE implementation, this text is encoded into 44513 tokens.

Loss

The loss goes from 108.2 to 0.

Epoch 0 Total_error 108.20975, change: NaN
Epoch 100 Total_error 0, change: -1
Epoch 300 Total_error 0, change: NaN

Example 0

input_text: {{short description|Video game franchise}}

{{About|the vi

expected_output_text: de

input_tokens: [22796, 30981, 11071, 20834, 19622, 31405, 30073, 11, 23050, 12550, 13, 32014, 19623, 16841, 6070, 17, 125, 28674, 1460, 15833, 20052, 8, 22923, 20, 22797, 21, 24292, 31531, 13, 5289, 126, 24990]

expected_output_token: [19624]

Example 0 before training:

Loss 11.230275
actual_output_token: [13142]
actual_output_text: me

Example 0 after training:

Loss -0
actual_output_token: [19624]
actual_output_text: de

Weirdness

The weird thing is that the neural network can generate a token that is not known to the BPE decoder. In this case, I chose to simply emit '?'.

Search This Blog

DSKernel: AI and Strength Training

AI: bump vocab_size from 256 to 34816 using byte pair encoding

Comments

Popular posts from this blog

The Thorium actor engine is operational now, we can start to work on actor applications for metagenomics

Learning to solve the example 1 of puzzle 3aa6fb7a in the ARC prize

The source code of SOAPdenovo2 sits in the shadows