feralvam commited on
Commit
adf8db0
·
1 Parent(s): ba14c59

Update app.py

Browse files
Files changed (1) hide show
  1. app.py +3 -3
app.py CHANGED
@@ -49,11 +49,11 @@ Our models are based on [BERTIN](https://huggingface.co/bertin-project). We fine
49
 
50
  Models showcased in the demo are marked with (*) above. More details about how we trained these models can be found in our [report](https://wandb.ai/readability-es/readability-es/reports/Texts-Readability-Analysis-for-Spanish--VmlldzoxNzU2MDUx).
51
 
52
- ## Final Remarks
53
 
54
- - **Data.** One of the main challenges in the area of Automatic Readability Assessment is the availability of reliable data. For Spanish, in particular, the highest-quality existing dataset is Newsela. However, it has a restrictive license that prohibits publicly-sharing its texts. In addition, since its texts are translations from original English news, they can suffer from [translationese](https://en.wiktionary.org/wiki/translationese) deeming them less suitable for training models that will analyse texts produced directly in Spanish. Therefore, our first challenge was to find texts that were originally written in Spanish *and* that contain information about their readability level. Unfortunately, we could not find any other big publicly-available corpus, and decided to combine texts scraped from several webpages. This also prevented us for developing models that could estimate readability in more fine-grained levels (e.g. CEFR levels), which was our original goal. Future work includes contacting editorial groups (similar to Newsela) that create texts for learners of Spanish as a second language, and attempt to establish collaborations that could result in creating new language resources for the readability research community.
55
 
56
- - **Models.**
57
 
58
  ### Team
59
 
 
49
 
50
  Models showcased in the demo are marked with (*) above. More details about how we trained these models can be found in our [report](https://wandb.ai/readability-es/readability-es/reports/Texts-Readability-Analysis-for-Spanish--VmlldzoxNzU2MDUx).
51
 
52
+ ### Final Remarks
53
 
54
+ - **Data.** One of the main challenges in the area of Readability Assessment is the availability of reliable data. For Spanish, in particular, the highest-quality existing dataset is Newsela. However, it has a restrictive license that prohibits publicly-sharing its texts. In addition, since its texts are translations from original English news, they can suffer from [translationese](https://en.wiktionary.org/wiki/translationese), deeming them less suitable for training models that will analyse texts produced directly in Spanish. Therefore, our first challenge was to find texts that were originally written in Spanish *and* that contained information about their readability level (i.e. the target gold label). Unfortunately, we could not find any other big publicly-available corpus, and decided to combine texts scraped from several webpages. This also prevented us for developing models that could estimate readability in more fine-grained levels (e.g. CEFR levels), which was our original goal. Future work will include contacting editorial groups that create texts for learners of Spanish as a second language, and establish collaborations that could result in creating new language resources for the readability research community.
55
 
56
+ - **Models.** As explained before, our models are direct fine-tuned versions of [BERTIN](https://huggingface.co/bertin-project). In the future, we aim to compare our models to fine-tuned versions of [multilingual BERT](https://huggingface.co/bert-base-multilingual-cased), to analyse whether multilingual embeddings could offer additional benefits. In addition, our current setting treats Readability Assessment as a classification task. Future work includes studying models that treat the problem as a regression task or, as [recent work suggests](https://arxiv.org/abs/2203.07450), as a pair-wise ranking problem.
57
 
58
  ### Team
59