Spaces:
Configuration error
Configuration error
File size: 5,918 Bytes
87bc419 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 |
# Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition
The official code of [ABINet](https://arxiv.org/pdf/2103.06495.pdf) (CVPR 2021, Oral).
ABINet uses a vision model and an explicit language model to recognize text in the wild, which are trained in end-to-end way. The language model (BCN) achieves bidirectional language representation in simulating cloze test, additionally utilizing iterative correction strategy.

## Runtime Environment
- We provide a pre-built docker image using the Dockerfile from `docker/Dockerfile`
- Running in Docker
```
$ [email protected]:FangShancheng/ABINet.git
$ docker run --gpus all --rm -ti --ipc=host -v $(pwd)/ABINet:/app fangshancheng/fastai:torch1.1 /bin/bash
```
- (Untested) Or using the dependencies
```
pip install -r requirements.txt
```
## Datasets
- Training datasets
1. [MJSynth](http://www.robots.ox.ac.uk/~vgg/data/text/) (MJ):
- Use `tools/create_lmdb_dataset.py` to convert images into LMDB dataset
- [LMDB dataset BaiduNetdisk(passwd:n23k)](https://pan.baidu.com/s/1mgnTiyoR8f6Cm655rFI4HQ)
2. [SynthText](http://www.robots.ox.ac.uk/~vgg/data/scenetext/) (ST):
- Use `tools/crop_by_word_bb.py` to crop images from original [SynthText](http://www.robots.ox.ac.uk/~vgg/data/scenetext/) dataset, and convert images into LMDB dataset by `tools/create_lmdb_dataset.py`
- [LMDB dataset BaiduNetdisk(passwd:n23k)](https://pan.baidu.com/s/1mgnTiyoR8f6Cm655rFI4HQ)
3. [WikiText103](https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip), which is only used for pre-trainig language models:
- Use `notebooks/prepare_wikitext103.ipynb` to convert text into CSV format.
- [CSV dataset BaiduNetdisk(passwd:dk01)](https://pan.baidu.com/s/1yabtnPYDKqhBb_Ie9PGFXA)
- Evaluation datasets, LMDB datasets can be downloaded from [BaiduNetdisk(passwd:1dbv)](https://pan.baidu.com/s/1RUg3Akwp7n8kZYJ55rU5LQ), [GoogleDrive](https://drive.google.com/file/d/1dTI0ipu14Q1uuK4s4z32DqbqF3dJPdkk/view?usp=sharing).
1. ICDAR 2013 (IC13)
2. ICDAR 2015 (IC15)
3. IIIT5K Words (IIIT)
4. Street View Text (SVT)
5. Street View Text-Perspective (SVTP)
6. CUTE80 (CUTE)
- The structure of `data` directory is
```
data
βββ charset_36.txt
βββ evaluation
βΒ Β βββ CUTE80
βΒ Β βββ IC13_857
βΒ Β βββ IC15_1811
βΒ Β βββ IIIT5k_3000
βΒ Β βββ SVT
βΒ Β βββ SVTP
βββ training
βΒ Β βββ MJ
βΒ Β βΒ Β βββ MJ_test
βΒ Β βΒ Β βββ MJ_train
βΒ Β βΒ Β βββ MJ_valid
βΒ Β βββ ST
βββ WikiText-103.csv
βββ WikiText-103_eval_d1.csv
```
### Pretrained Models
Get the pretrained models from [BaiduNetdisk(passwd:kwck)](https://pan.baidu.com/s/1b3vyvPwvh_75FkPlp87czQ), [GoogleDrive](https://drive.google.com/file/d/1mYM_26qHUom_5NU7iutHneB_KHlLjL5y/view?usp=sharing). Performances of the pretrained models are summaried as follows:
|Model|IC13|SVT|IIIT|IC15|SVTP|CUTE|AVG|
|-|-|-|-|-|-|-|-|
|ABINet-SV|97.1|92.7|95.2|84.0|86.7|88.5|91.4|
|ABINet-LV|97.0|93.4|96.4|85.9|89.5|89.2|92.7|
## Training
1. Pre-train vision model
```
CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --config=configs/pretrain_vision_model.yaml
```
2. Pre-train language model
```
CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --config=configs/pretrain_language_model.yaml
```
3. Train ABINet
```
CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --config=configs/train_abinet.yaml
```
Note:
- You can set the `checkpoint` path for vision and language models separately for specific pretrained model, or set to `None` to train from scratch
## Evaluation
```
CUDA_VISIBLE_DEVICES=0 python main.py --config=configs/train_abinet.yaml --phase test --image_only
```
Additional flags:
- `--checkpoint /path/to/checkpoint` set the path of evaluation model
- `--test_root /path/to/dataset` set the path of evaluation dataset
- `--model_eval [alignment|vision]` which sub-model to evaluate
- `--image_only` disable dumping visualization of attention masks
## Web Demo
Integrated into [Huggingface Spaces π€](https://huggingface.co/spaces) using [Gradio](https://github.com/gradio-app/gradio). Try out the Web Demo: [](https://huggingface.co/spaces/tomofi/ABINet-OCR)
## Run Demo
```
python demo.py --config=configs/train_abinet.yaml --input=figs/test
```
Additional flags:
- `--config /path/to/config` set the path of configuration file
- `--input /path/to/image-directory` set the path of image directory or wildcard path, e.g, `--input='figs/test/*.png'`
- `--checkpoint /path/to/checkpoint` set the path of trained model
- `--cuda [-1|0|1|2|3...]` set the cuda id, by default -1 is set and stands for cpu
- `--model_eval [alignment|vision]` which sub-model to use
- `--image_only` disable dumping visualization of attention masks
## Visualization
Successful and failure cases on low-quality images:

## Citation
If you find our method useful for your reserach, please cite
```bash
@article{fang2021read,
title={Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition},
author={Fang, Shancheng and Xie, Hongtao and Wang, Yuxin and Mao, Zhendong and Zhang, Yongdong},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2021}
}
```
## License
This project is only free for academic research purposes, licensed under the 2-clause BSD License - see the LICENSE file for details.
Feel free to contact [email protected] if you have any questions.
|