README.md · bertin-project/filiberto-124M-instruct at main

filiberto-124M-instruct / README.md

versae

Update README.md

12e6c77 verified 9 months ago

preview code

raw

history blame contribute delete

3.13 kB

	---
	license: apache-2.0
	base_model:
	- PleIAs/OCRonos-Vintage
	library_name: transformers
	language:
	- es
	pipeline_tag: text-generation
	tags:
	- OCR
	- text-correction
	- ocr-correction
	- archives
	- GPT2
	- history
	- SLM
	- pre-train
	- drama
	---

	Filiberto 124M Instruct is a small specialized model for OCR correction of Spanish Golden Age Dramas OCR, based on the [OCRonos-Vintage](https://hf.co/PleIAs/OCRonos-Vintage) model for cultural heritage archives OCR correction.

	Filiberto 124M Instruct is only 124 million parameters. It can run easily on CPU or provide correction at scale on GPUs (>10k tokens/seconds).

	## Training
	The pre-trained included a collection of individual verses and their correction taken from the [TEXORO](https://etso.es/texoro) corpus, via a collaboration with [ETSO](https://etso.es/), totalling ~5 million tokens.

	Pre-training ran on 5 epochs with levanter (500 steps total, each processing 1024 sequences of 512 tokens) on a TPUv4-32 for 15 minutes.

	Tokenization is currently done with the GPT-2 tokenizer.

	## Example of OCR correction
	Filiberto 124M Instruct has been pre-trained on an instruction dataset with a hard-coded structure: `### Text ###` for OCRized text submissiong and `### Correction ###` for the generated correction.

	Filiberto 124M Instruct can be imported like any GPT-2 like model:

	```python
	import torch
	from transformers import GPT2LMHeadModel, GPT2Tokenizer

	# Load pre-trained model and tokenizer
	model_name = "bertin-project/filiberto-124M-instruct"
	model = GPT2LMHeadModel.from_pretrained(model_name)
	tokenizer = GPT2Tokenizer.from_pretrained(model_name)

	# Set the device to GPU if available, otherwise use CPU
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	model.to(device)
	```

	And afterwards inference can be run like this:

	```python
	# Function to generate text
	def ocr_correction(prompt, max_new_tokens=600):

	prompt = f"""### Text ###\n{prompt}\n\n\n### Correction ###\n"""
	input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)

	# Generate text
	output = model.generate(input_ids,
	max_new_tokens=max_new_tokens,
	pad_token_id=tokenizer.eos_token_id,
	top_k=50)

	# Decode and return the generated text
	return tokenizer.decode(output[0], skip_special_tokens=True).split("### Correction ###")[-1].strip()

	ocr_result = ocr_correction(prompt)
	print(ocr_result)
	```

	An example of an OCRized drama:

	```
	Otra vez, Don Iuan, me dad,
	y otras mil vezes los braços.
	Otra, y otras mil sean lazos
	de nuestra antigua amistad.
	Como venis?
	Yo me siento
	tan alegre, tan vfano,
	tan venturoso, tan vano,
	que no podrà el pensamiento
	encareceros jamàs
	las venturas que posseo,
	porque el pensamiento creo
	```

	would yield this result:

	```
	Otra vez, Don Iuan, me dad,
	y otras mil vezes los braços.
	Otra, y otras mil sean lazos
	de nuestra antigua amistad.
	Como venis?
	Yo me siento
	tan alegre, tan vfano,
	tan venturoso, tan vano,
	que no podrà el pensamiento
	encareceros jamàs
	las venturas que posseo,
	porque el pensamiento creo
	```