|
--- |
|
library_name: transformers |
|
license: apache-2.0 |
|
datasets: |
|
- ds4sd/DocLayNet |
|
pipeline_tag: image-segmentation |
|
--- |
|
|
|
# DETR-layout-detection |
|
|
|
We present the model cmarkea/detr-layout-detection, which allows extracting different layouts (Text, Picture, Caption, Footnote, etc.) from an image of a document. |
|
This is a fine-tuning of the model [detr-resnet-50](https://huggingface.co/facebook/detr-resnet-50) on the [DocLayNet](https://huggingface.co/datasets/ds4sd/DocLayNet) |
|
dataset. This model can jointly predict masks and bounding boxes for documentary objects. It is ideal for processing documentary corpora to be ingested into an |
|
ODQA system. |
|
|
|
This model allows extracting 11 entities, which are: Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Section-header, Table, Text, and Title. |
|
|
|
## Performance |
|
|
|
In this section, we will assess the model's performance by separately considering semantic segmentation and object detection. In both cases, no post-processing was |
|
applied after estimation. |
|
|
|
For semantic segmentation, we will use the F1-score to evaluate the classification of each pixel. For object detection, we will assess performance based on the |
|
Generalized Intersection over Union (GIoU) and the accuracy of the predicted bounding box class. The evaluation is conducted on 500 pages from the PDF evaluation |
|
dataset of DocLayNet. |
|
|
|
| Class | f1-score (x100) | GIoU (x100) | accuracy (x100) | |
|
|:--------------:|:---------------:|:-----------:|:---------------:| |
|
| Background | 94.27 | NA | NA | |
|
| Caption | 72.08 | 67.17 | 43.45 | |
|
| Footnote | 0 | 67.33 | 0 | |
|
| Formula | 83 | 72.7 | 93.92 | |
|
| List-item | 73.12 | 70.62 | 90.26 | |
|
| Page-footer | 85.37 | 67.66 | 97.42 | |
|
| Page-header | 65.55 | 71.54 | 85.24 | |
|
| Picture | 77.66 | 80.92 | 91.19 | |
|
| Section-header | 67.84 | 69.25 | 85.85 | |
|
| Table | 79.75 | 81.76 | 90.13 | |
|
| Text | 89.74 | 79.51 | 88 | |
|
| Title | 48.74 | 52.04 | 36.67 | |
|
|
|
## Benchmark |
|
|
|
Now, let's compare the performance of this model with other models. |
|
|
|
| Class | f1-score (x100) | GIoU (x100) | accuracy (x100) | |
|
|:---------------------------------------------------------------------------------------------:|:---------------:|:-----------:|:---------------:| |
|
| cmarkea/detr-layout-detection | 89.72 | 76.42 | 87.06 | |
|
| [cmarkea/dit-base-layout-detection](https://huggingface.co/cmarkea/dit-base-layout-detection) | 90.77 | 56.29 | 85.26 | |
|
|
|
## Direct Use |
|
|
|
```python |
|
from transformers import AutoImageProcessor |
|
from transformers.models.detr import DetrForSegmentation |
|
|
|
img_proc = AutoImageProcessor.from_pretrained( |
|
"cmarkea/detr-layout-detection" |
|
) |
|
model = DetrForSegmentation.from_pretrained( |
|
"cmarkea/detr-layout-detection" |
|
) |
|
|
|
with torch.inference_mode(): |
|
input_ids = img_proc(img, return_tensors='pt') |
|
output = model(**input_ids) |
|
|
|
threshold=0.4 |
|
|
|
segmentation_mask = img_proc.post_process_segmentation( |
|
output, |
|
threshold=threshold, |
|
target_sizes=[img.size[::-1]] |
|
) |
|
|
|
bbox_pred = img_proc.post_process_object_detection( |
|
output, |
|
threshold=threshold, |
|
target_sizes=[img.size[::-1]] |
|
) |
|
``` |
|
|
|
### Citation |
|
|
|
``` |
|
@online{DeDetrLay, |
|
AUTHOR = {Cyrile Delestre}, |
|
URL = {https://huggingface.co/cmarkea/detr-base-layout-detection}, |
|
YEAR = {2024}, |
|
KEYWORDS = {Image Processing ; Transformers ; Layout}, |
|
} |
|
``` |