|
--- |
|
license: apache-2.0 |
|
base_model: |
|
- PleIAs/OCRonos-Vintage |
|
library_name: transformers |
|
language: |
|
- es |
|
pipeline_tag: text-generation |
|
tags: |
|
- OCR |
|
- text-correction |
|
- ocr-correction |
|
- archives |
|
- GPT2 |
|
- history |
|
- SLM |
|
- pre-train |
|
- drama |
|
--- |
|
|
|
**Filiberto 124M Instruct** is a small specialized model for OCR correction of Spanish Golden Age Dramas OCR, based on the [OCRonos-Vintage](https://hf.co/PleIAs/OCRonos-Vintage) model for cultural heritage archives OCR correction. |
|
|
|
Filiberto 124M Instruct is only 124 million parameters. It can run easily on CPU or provide correction at scale on GPUs (>10k tokens/seconds). |
|
|
|
## Training |
|
The pre-trained included a collection of individual verses and their correction taken from the [TEXORO](https://etso.es/texoro) corpus, via a collaboration with [ETSO](https://etso.es/), totalling ~5 million tokens. |
|
|
|
Pre-training ran on 5 epochs with levanter (500 steps total, each processing 1024 sequences of 512 tokens) on a TPUv4-32 for 15 minutes. |
|
|
|
Tokenization is currently done with the GPT-2 tokenizer. |
|
|
|
## Example of OCR correction |
|
Filiberto 124M Instruct has been pre-trained on an instruction dataset with a hard-coded structure: `### Text ###` for OCRized text submissiong and `### Correction ###` for the generated correction. |
|
|
|
Filiberto 124M Instruct can be imported like any GPT-2 like model: |
|
|
|
```python |
|
import torch |
|
from transformers import GPT2LMHeadModel, GPT2Tokenizer |
|
|
|
# Load pre-trained model and tokenizer |
|
model_name = "bertin-project/filiberto-124M-instruct" |
|
model = GPT2LMHeadModel.from_pretrained(model_name) |
|
tokenizer = GPT2Tokenizer.from_pretrained(model_name) |
|
|
|
# Set the device to GPU if available, otherwise use CPU |
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
model.to(device) |
|
``` |
|
|
|
And afterwards inference can be run like this: |
|
|
|
```python |
|
# Function to generate text |
|
def ocr_correction(prompt, max_new_tokens=600): |
|
|
|
prompt = f"""### Text ###\n{prompt}\n\n\n### Correction ###\n""" |
|
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device) |
|
|
|
# Generate text |
|
output = model.generate(input_ids, |
|
max_new_tokens=max_new_tokens, |
|
pad_token_id=tokenizer.eos_token_id, |
|
top_k=50) |
|
|
|
# Decode and return the generated text |
|
return tokenizer.decode(output[0], skip_special_tokens=True).split("### Correction ###")[-1].strip() |
|
|
|
ocr_result = ocr_correction(prompt) |
|
print(ocr_result) |
|
``` |
|
|
|
An example of an OCRized drama: |
|
|
|
``` |
|
Otra vez, Don Iuan, me dad, |
|
y otras mil vezes los braços. |
|
Otra, y otras mil sean lazos |
|
de nuestra antigua amistad. |
|
Como venis? |
|
Yo me siento |
|
tan alegre, tan vfano, |
|
tan venturoso, tan vano, |
|
que no podrà el pensamiento |
|
encareceros jamàs |
|
las venturas que posseo, |
|
porque el pensamiento creo |
|
``` |
|
|
|
would yield this result: |
|
|
|
``` |
|
Otra vez, Don Iuan, me dad, |
|
y otras mil vezes los braços. |
|
Otra, y otras mil sean lazos |
|
de nuestra antigua amistad. |
|
Como venis? |
|
Yo me siento |
|
tan alegre, tan vfano, |
|
tan venturoso, tan vano, |
|
que no podrà el pensamiento |
|
encareceros jamàs |
|
las venturas que posseo, |
|
porque el pensamiento creo |
|
``` |