dm.cs.tu-dortmund.de/mlbits/text-mining-subword-tokenization/
Subword Tokenization – Lecture Notes
. Replace Z = or .
Most common sequences: XZ . Replace W = XZ = wor .
No sequences with more than 1 occurrence. Stop.
Dictionary: W = wor , X = w , Y = th , Z = or
Note: there can be ties, and neither [...] algorithm to estimate the token probabilities \(P(x)\)
keep the words with the largest loss when removed 1
References
[DCLT19]
Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. 2019. BERT: Pre-training of …