2018
Cite Score
73
AI summary
This paper introduces SentencePiece, a language-independent subword tokenizer and detokenizer for neural text processing, which can train subword models directly from raw sentences, achieving comparable accuracy to direct subword training from raw sentences on English-Japanese machine translation using NMT.
Main Contributions
Abstract
This paper describes SentencePiece, a language-independent subword tokenizer and detokenizer designed for Neural-based text processing, including Neural Machine Translation. It provides open-source C++ and Python implementations for subword units. While existing subword segmentation tools assume that the input is pre-tokenized into word sequences, SentencePiece can train subword models directly from raw sentences, which allows us to make a purely end-to-end and language independent system. We perform a validation experiment of NMT on English-Japanese machine translation, and find that it is possible to achieve comparable accuracy to direct subword training from raw sentences. We also compare the performance of subword training and segmentation with various configurations. SentencePiece is available under the Apache 2 license at https://github.com/google/sentencepiece.
Citation Graph
References [15]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin - 2017
47 papers in library cite
D. Bahdanau, Kyunghyun Cho, Yoshua Bengio - 2014
59 papers in library cite
K. Papineni, S. Roukos, T. Ward, Wei Jing Zhu - 2002
19 papers in library cite
T. Luong, H. Pham, Christopher D. Manning - 2015
15 papers in library cite
R. Sennrich, B. Haddow, Alexandra Birch - 2016
22 papers in library cite
Yonghui Wu, M. Schuster, Ziru Chen, Quoc V. Le, M. Norouzi, W. Macherey, M. Krikun, Yue Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. J. Johnson, Xiaodong Liu, Lukasz Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, Wenyi Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, Oriol Vinyals, G. S. Corrado, M. Hughes, Jeffrey Dean - 2016
15 papers in library cite
Alexander M. Rush, S. Chopra, Jason Weston - 2015
13 papers in library cite
Oriol Vinyals, Quoc V. Le - 2015
7 papers in library cite
M. Artetxe, G. Labaka, E. Agirre, Kyunghyun Cho - 2017
4 papers in library cite
M. J. Johnson, M. Schuster, Quoc V. Le, M. Krikun, Yonghui Wu, Ziru Chen, N. Thorat, F. B. Viegas, M. Wattenberg, G. S. Corrado, M. Hughes, Jeffrey Dean - 2017
7 papers in library cite
G. Lample, L. Denoyer, Marc'aurelio Ranzato - 2017
4 papers in library cite
M. Denkowski, Graham Neubig - 2017
2 papers in library cite
M. Post - 2018
2 papers in library cite
T. Nakazawa, S. Higashiyama - 2017
1 paper in library cites
T. Kudo - 2018
1 paper in library cites
Cited by
3
papers in your library
Cites
12
papers in your library
Read
on December 30, 2025
Your review
Tags
Paper Aliases
No aliases