Update README.md
Browse files
README.md
CHANGED
@@ -35,6 +35,8 @@ The dataset used consists of about 1.2 M text-image pairs with data from a varie
|
|
35 |
|
36 |
During training the vision tower was kept completely frozen, along with logit_scale, logit_bias, and the text tower's head. The rest of the text tower was left unfrozen. This is to help ensure that the finetuning process preserves the original embedding space, and focusses on merely upgrading the context length and types of text.
|
37 |
|
|
|
|
|
38 |
In practice I've found that this model performs slightly better than the base SigLIP 2 so400m, but tends to prefer shorter text. i.e. given two texts that both perfectly describe the image, the model will tend to weight the shorter of the two higher. The model's ability to recognize booru tag lists for photorealistic images is also imperfect.
|
39 |
|
40 |
|
|
|
35 |
|
36 |
During training the vision tower was kept completely frozen, along with logit_scale, logit_bias, and the text tower's head. The rest of the text tower was left unfrozen. This is to help ensure that the finetuning process preserves the original embedding space, and focusses on merely upgrading the context length and types of text.
|
37 |
|
38 |
+
The position embeddings were expanded by leaving the original 64 embeddings intact in their original positions, while initializing the new positions randomly. No ablations were perform to determine if this is the optimial approach. However I noted during experimentation that the model is fairly insensitive to the position embeddings.
|
39 |
+
|
40 |
In practice I've found that this model performs slightly better than the base SigLIP 2 so400m, but tends to prefer shorter text. i.e. given two texts that both perfectly describe the image, the model will tend to weight the shorter of the two higher. The model's ability to recognize booru tag lists for photorealistic images is also imperfect.
|
41 |
|
42 |
|