Update README.md
Browse files
README.md
CHANGED
@@ -22,6 +22,7 @@ __Chonky__ is a transformer model that intelligently segments text into meaningf
|
|
22 |
|
23 |
The model processes text and divides it into semantically coherent segments. These chunks can then be fed into embedding-based retrieval systems or language models as part of a RAG pipeline.
|
24 |
|
|
|
25 |
|
26 |
## How to use
|
27 |
|
@@ -63,7 +64,7 @@ from transformers import AutoTokenizer, AutoModelForTokenClassification, pipelin
|
|
63 |
|
64 |
model_name = "mirth/chonky_modernbert_base_1"
|
65 |
|
66 |
-
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
67 |
|
68 |
id2label = {
|
69 |
0: "O",
|
@@ -81,6 +82,7 @@ model = AutoModelForTokenClassification.from_pretrained(
|
|
81 |
label2id=label2id,
|
82 |
)
|
83 |
|
|
|
84 |
pipe = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
|
85 |
|
86 |
text = """Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights."""
|
|
|
22 |
|
23 |
The model processes text and divides it into semantically coherent segments. These chunks can then be fed into embedding-based retrieval systems or language models as part of a RAG pipeline.
|
24 |
|
25 |
+
⚠️This model was fine-tuned on sequence of length 1024 (by default ModernBERT supports sequence length up to 8192).
|
26 |
|
27 |
## How to use
|
28 |
|
|
|
64 |
|
65 |
model_name = "mirth/chonky_modernbert_base_1"
|
66 |
|
67 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name, model_max_length=1024)
|
68 |
|
69 |
id2label = {
|
70 |
0: "O",
|
|
|
82 |
label2id=label2id,
|
83 |
)
|
84 |
|
85 |
+
|
86 |
pipe = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
|
87 |
|
88 |
text = """Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights."""
|