mirth commited on
Commit
d999c83
·
verified ·
1 Parent(s): b3484e7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -1
README.md CHANGED
@@ -22,6 +22,7 @@ __Chonky__ is a transformer model that intelligently segments text into meaningf
22
 
23
  The model processes text and divides it into semantically coherent segments. These chunks can then be fed into embedding-based retrieval systems or language models as part of a RAG pipeline.
24
 
 
25
 
26
  ## How to use
27
 
@@ -63,7 +64,7 @@ from transformers import AutoTokenizer, AutoModelForTokenClassification, pipelin
63
 
64
  model_name = "mirth/chonky_modernbert_base_1"
65
 
66
- tokenizer = AutoTokenizer.from_pretrained(model_name)
67
 
68
  id2label = {
69
  0: "O",
@@ -81,6 +82,7 @@ model = AutoModelForTokenClassification.from_pretrained(
81
  label2id=label2id,
82
  )
83
 
 
84
  pipe = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
85
 
86
  text = """Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights."""
 
22
 
23
  The model processes text and divides it into semantically coherent segments. These chunks can then be fed into embedding-based retrieval systems or language models as part of a RAG pipeline.
24
 
25
+ ⚠️This model was fine-tuned on sequence of length 1024 (by default ModernBERT supports sequence length up to 8192).
26
 
27
  ## How to use
28
 
 
64
 
65
  model_name = "mirth/chonky_modernbert_base_1"
66
 
67
+ tokenizer = AutoTokenizer.from_pretrained(model_name, model_max_length=1024)
68
 
69
  id2label = {
70
  0: "O",
 
82
  label2id=label2id,
83
  )
84
 
85
+
86
  pipe = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
87
 
88
  text = """Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights."""