mirth
/

chonky_modernbert_base_1

Token Classification

Model card Files Files and versions Community

mirth commited on 18 days ago

Commit

d999c83

·

verified ·

1 Parent(s): b3484e7

Update README.md

Files changed (1) hide show

README.md +3 -1

README.md CHANGED Viewed

@@ -22,6 +22,7 @@ __Chonky__ is a transformer model that intelligently segments text into meaningf
 The model processes text and divides it into semantically coherent segments. These chunks can then be fed into embedding-based retrieval systems or language models as part of a RAG pipeline.
 ## How to use
@@ -63,7 +64,7 @@ from transformers import AutoTokenizer, AutoModelForTokenClassification, pipelin
 model_name = "mirth/chonky_modernbert_base_1"
-tokenizer = AutoTokenizer.from_pretrained(model_name)
 id2label = {
     0: "O",
@@ -81,6 +82,7 @@ model = AutoModelForTokenClassification.from_pretrained(
     label2id=label2id,
 )
 pipe = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
 text = """Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights."""

 The model processes text and divides it into semantically coherent segments. These chunks can then be fed into embedding-based retrieval systems or language models as part of a RAG pipeline.
+⚠️This model was fine-tuned on sequence of length 1024 (by default ModernBERT supports sequence length up to 8192).
 ## How to use
 model_name = "mirth/chonky_modernbert_base_1"
+tokenizer = AutoTokenizer.from_pretrained(model_name, model_max_length=1024)
 id2label = {
     0: "O",
     label2id=label2id,
 )
 pipe = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
 text = """Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights."""