2018

SentencePiece: A Simple and Language Independent subword Tokenizer and Detokenizer for Neural Text Processing

John Richardson

citations

Cite Score

73

AI summary

This paper introduces SentencePiece, a language-independent subword tokenizer and detokenizer for neural text processing, which can train subword models directly from raw sentences, achieving comparable accuracy to direct subword training from raw sentences on English-Japanese machine translation using NMT.

Main Contributions

  • Introduces SentencePiece, an open-source subword tokenizer and detokenizer designed for Neural-based text processing.
  • Enables training subword models directly from raw sentences, allowing for purely end-to-end and language-independent systems.
  • Achieves comparable accuracy to direct subword training from raw sentences on English-Japanese machine translation using NMT.
  • The model file of SentencePiece is self-contained to guarantee perfect reproducibility of the normalization and subword segmentation.
  • Provides a stable and reproducible text processing tool for production use and helps the research community to move to more language-agnostic and multilingual architectures.

Abstract

This paper describes SentencePiece, a language-independent subword tokenizer and detokenizer designed for Neural-based text processing, including Neural Machine Translation. It provides open-source C++ and Python implementations for subword units. While existing subword segmentation tools assume that the input is pre-tokenized into word sequences, SentencePiece can train subword models directly from raw sentences, which allows us to make a purely end-to-end and language independent system. We perform a validation experiment of NMT on English-Japanese machine translation, and find that it is possible to achieve comparable accuracy to direct subword training from raw sentences. We also compare the performance of subword training and segmentation with various configurations. SentencePiece is available under the Apache 2 license at https://github.com/google/sentencepiece.

Citation Graph

Loading graph...

References [15]

Sort:
Filter:

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin - 2017

47 papers in library cite

D. Bahdanau, Kyunghyun Cho, Yoshua Bengio - 2014

59 papers in library cite

K. Papineni, S. Roukos, T. Ward, Wei Jing Zhu - 2002

19 papers in library cite

T. Luong, H. Pham, Christopher D. Manning - 2015

15 papers in library cite

R. Sennrich, B. Haddow, Alexandra Birch - 2016

22 papers in library cite

Yonghui Wu, M. Schuster, Ziru Chen, Quoc V. Le, M. Norouzi, W. Macherey, M. Krikun, Yue Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. J. Johnson, Xiaodong Liu, Lukasz Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, Wenyi Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, Oriol Vinyals, G. S. Corrado, M. Hughes, Jeffrey Dean - 2016

15 papers in library cite

Alexander M. Rush, S. Chopra, Jason Weston - 2015

13 papers in library cite

Oriol Vinyals, Quoc V. Le - 2015

7 papers in library cite

M. Artetxe, G. Labaka, E. Agirre, Kyunghyun Cho - 2017

4 papers in library cite

M. J. Johnson, M. Schuster, Quoc V. Le, M. Krikun, Yonghui Wu, Ziru Chen, N. Thorat, F. B. Viegas, M. Wattenberg, G. S. Corrado, M. Hughes, Jeffrey Dean - 2017

7 papers in library cite

G. Lample, L. Denoyer, Marc'aurelio Ranzato - 2017

4 papers in library cite

M. Denkowski, Graham Neubig - 2017

2 papers in library cite

M. Post - 2018

2 papers in library cite

T. Nakazawa, S. Higashiyama - 2017

1 paper in library cites

Cited by

3

papers in your library

Cites

12

papers in your library

Read

on December 30, 2025

Your review

Tags

Paper Aliases

No aliases