1-800-BAD-CODE
/

sentence_boundary_detection_multilang

sentence boundary detection

token classification

Model card Files Files and versions Community

1-800-BAD-CODE commited on Jan 1, 2023

Commit

3e20b04

·

1 Parent(s): e62dcf1

Update README.md

Files changed (1) hide show

README.md +46 -0

README.md CHANGED Viewed

@@ -1,3 +1,49 @@
 ---
 license: apache-2.0
 ---

 ---
 license: apache-2.0
+language:
+  - ar
+  - bn
+  - de
+  - en
+  - es
+  - et
+  - fi
+  - fr
+  - hi
+  - id
+  - is
+  - it
+  - ja
+  - lt
+  - lv
+  - ko
+  - nl
+  - no
+  - pl
+  - pt
+  - ru
+  - tr
+  - sv
+  - uk
+  - zh
 ---
+# Model Overview
+This model performs sentence boundary prediction (SBD) with 25 languages.
+This model accepts as input arbitraily-long, punctuated texts and produces as output the consituent sentences of the input.
+# Model Architecture
+This is a data-driven approach to SBD.
+Input texts are encoded with a SentencePiece model, then encoded with a BERT-style encoder, then projected to sentence boundary probabilities via a linear layer.
+For each input token `t`, this model predicts the probability that `t` is the final token of a sentence (i.e., a sentence boundary).
+# Example Usage
+This model has been exported to ONNX alongside an SentencePiece tokenizer.
+```bash
+```