Update README.md
Browse files
README.md
CHANGED
@@ -30,19 +30,26 @@ Note that Jellyfish is only a 13B model and can be run locally for low cost and
|
|
30 |
| Error Detection | Adult | 94.40| 92.01 | 92.01 | 96.62 | 90.13 |
|
31 |
| Schema Matching | Sythea | 38.50| 57.14 | 66.67 | 36.36 | 30.77 |
|
32 |
|
33 |
-
_Accuracy as the metric for data imputation, and the
|
|
|
34 |
1.
|
35 |
[Ditto](https://arxiv.org/abs/2004.00584) for Entity Matching
|
36 |
[SMAT](https://www.researchgate.net/publication/353920530_SMAT_An_Attention-Based_Deep_Learning_Solution_to_the_Automation_of_Schema_Matching) for Schema Matching
|
37 |
[HoloDetect](https://arxiv.org/abs/1904.02285) for Error Detection
|
38 |
[HoloClean](https://arxiv.org/abs/1702.00820) for Data Imputation
|
39 |
-
2.
|
|
|
|
|
40 |
|
41 |
We release two versions of Jellyfish: the Jellyfish-13B (the main branch) and Jellyfish-13B-Reasoning.
|
42 |
As the names suggest, Jellyfish-13B focuses on providing accurate, direct answers.
|
43 |
In contrast, Jellyfish-13B-Reasoning distills knowledge from GPT-4. It fine-tuned with data containing reasons and chain-of-thought responses for solving data preprocessing tasks
|
44 |
generated by GPT-4.
|
45 |
|
|
|
|
|
|
|
|
|
46 |
**Jellyfish paper will be coming soon!**
|
47 |
|
48 |
- **Developed by:** Haochen Zhang, Yuyang Dong, Chuan Xiao, Masafumi Oyamada
|
@@ -77,30 +84,73 @@ We used LoRA to speed up the training process, targeting the q_proj and v_proj m
|
|
77 |
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
78 |
Here are the prompts we used for both fine-tuning the model and for inference. Feel free to explore different prompts on your own to achieve the best generation quality.
|
79 |
|
80 |
-
###
|
|
|
81 |
```
|
82 |
-
You are tasked with determining whether two records listed below are the same based on the information provided.
|
83 |
-
|
84 |
-
Note: Missing values (N/A or \"nan\") should not be used as a basis for your decision.
|
85 |
-
|
86 |
-
Record
|
87 |
-
|
88 |
Are record A and record B the same entity? Choose your answer from: [Yes, No]
|
89 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
90 |
|
91 |
### For JellyFish-13B-reasoning
|
|
|
92 |
```
|
93 |
-
You are tasked with determining whether two products listed below are the same based on the information provided.
|
94 |
-
|
95 |
-
Note: Missing values (N/A or \"nan\") should not be used as a basis for your decision.
|
96 |
-
|
97 |
-
Record
|
|
|
|
|
|
|
|
|
98 |
|
99 |
-
|
100 |
|
101 |
-
|
102 |
|
103 |
-
```
|
104 |
|
105 |
<!--
|
106 |
## Bias, Risks, and Limitations
|
|
|
30 |
| Error Detection | Adult | 94.40| 92.01 | 92.01 | 96.62 | 90.13 |
|
31 |
| Schema Matching | Sythea | 38.50| 57.14 | 66.67 | 36.36 | 30.77 |
|
32 |
|
33 |
+
_Accuracy as the metric for data imputation, and the F1 score for other tasks._
|
34 |
+
_For GPT-3.5, we used the few-shot approach, while for Jellyfish and Jellyfish-Reasoning, the zero-shot approach was employed._
|
35 |
1.
|
36 |
[Ditto](https://arxiv.org/abs/2004.00584) for Entity Matching
|
37 |
[SMAT](https://www.researchgate.net/publication/353920530_SMAT_An_Attention-Based_Deep_Learning_Solution_to_the_Automation_of_Schema_Matching) for Schema Matching
|
38 |
[HoloDetect](https://arxiv.org/abs/1904.02285) for Error Detection
|
39 |
[HoloClean](https://arxiv.org/abs/1702.00820) for Data Imputation
|
40 |
+
2.
|
41 |
+
[Large Language Models as Data Preprocessors](https://arxiv.org/abs/2308.16361)
|
42 |
+
|
43 |
|
44 |
We release two versions of Jellyfish: the Jellyfish-13B (the main branch) and Jellyfish-13B-Reasoning.
|
45 |
As the names suggest, Jellyfish-13B focuses on providing accurate, direct answers.
|
46 |
In contrast, Jellyfish-13B-Reasoning distills knowledge from GPT-4. It fine-tuned with data containing reasons and chain-of-thought responses for solving data preprocessing tasks
|
47 |
generated by GPT-4.
|
48 |
|
49 |
+
The two versions are designed for different application scenarios.
|
50 |
+
Jellyfish-13B is suitable for integration into larger data management systems due to its simple and clear responses that can be easily transformed into code.
|
51 |
+
On the other hand, Jellyfish-13B-Reasoning is more user-oriented, with responses that offer deeper insights into the data.
|
52 |
+
|
53 |
**Jellyfish paper will be coming soon!**
|
54 |
|
55 |
- **Developed by:** Haochen Zhang, Yuyang Dong, Chuan Xiao, Masafumi Oyamada
|
|
|
84 |
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
85 |
Here are the prompts we used for both fine-tuning the model and for inference. Feel free to explore different prompts on your own to achieve the best generation quality.
|
86 |
|
87 |
+
### JellyFish-13B
|
88 |
+
#### For Entity Matching
|
89 |
```
|
90 |
+
You are tasked with determining whether two records listed below are the same based on the information provided.
|
91 |
+
Carefully compare the {attribute 1}, {attribute 2}... for each record before making your decision.
|
92 |
+
Note: Missing values (N/A or \"nan\") should not be used as a basis for your decision.
|
93 |
+
Record A: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}...]
|
94 |
+
Record B: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}...]
|
|
|
95 |
Are record A and record B the same entity? Choose your answer from: [Yes, No]
|
96 |
```
|
97 |
+
#### For Data Imputation
|
98 |
+
```
|
99 |
+
You are presented with a {keyword} record that is missing a specific attribute: {attribute X}.
|
100 |
+
Your task is to deduce or infer the value of {attribute X} using the available information in the record.
|
101 |
+
You may be provided with fields like {attribute 1}, {attribute 2}, ... to help you in the inference.
|
102 |
+
Record: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
|
103 |
+
Based on the provided record, what would you infer is the value for the missing attribute {attribute X}?
|
104 |
+
Answer only the value of {attribute X}.
|
105 |
+
```
|
106 |
+
#### For Error Detection
|
107 |
+
_There are two forms of the error detection task.
|
108 |
+
In the first form, a complete record row is provided, and the task is to determine if a specific value is erroneous.
|
109 |
+
In the second form, only the value of a specific attribute is given, and the decision about its correctness is based solely on the attribute's name and value.
|
110 |
+
The subsequent prompt examples pertain to these two forms, respectively._
|
111 |
+
```
|
112 |
+
Your task is to determine if there is an error in the value of a specific attribute within the whole record provided.
|
113 |
+
The attributes may include {attribute 1}, {attribute 2}, ...
|
114 |
+
Errors may include, but are not limited to, spelling errors, inconsistencies, or values that don't make sense given the context of the whole record.
|
115 |
+
Record [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}...]
|
116 |
+
Attribute for Verification: [{attribute X}: {attribute X value}]
|
117 |
+
Question: Is there an error in the value of {attribute X}? Choose your answer from: [Yes, No]
|
118 |
+
```
|
119 |
+
```
|
120 |
+
Your task is to determine if there is an error in the value of a specific attribute.
|
121 |
+
The attributes may belong to a healthcare-related record and could be one of the following: {attribute 1}, {attribute 2}, ...
|
122 |
+
Errors can include, but are not limited to, spelling errors, inconsistencies, or values that don't make sense for that attribute.
|
123 |
+
Note: Missing values (N/A or \"nan\") are not considered errors.
|
124 |
+
Attribute for Verification: [{attribute X}: {attribute X value}]
|
125 |
+
Question: Is there an error in the value of {attribute X}? Choose your answer from: [Yes, No]
|
126 |
+
```
|
127 |
+
#### For Schema Matching
|
128 |
+
```
|
129 |
+
Your task is to determine if the two attributes (columns) are semantically equivalent in the context of merging two tables.
|
130 |
+
Each attribute will be provided by its name and a brief description.
|
131 |
+
Your goal is to assess if they refer to the same information based on these names and descriptions provided.
|
132 |
+
Attribute A is [name: {the value of name}, description: {the value of description}].
|
133 |
+
Attribute B is [name: {the value of name}, description: {the value of description}].
|
134 |
+
Are Attribute A and Attribute B semantically equivalent? Choose your answer from: [Yes, No]
|
135 |
+
```
|
136 |
|
137 |
### For JellyFish-13B-reasoning
|
138 |
+
#### For Entity Matching
|
139 |
```
|
140 |
+
You are tasked with determining whether two products listed below are the same based on the information provided.
|
141 |
+
Carefully examine all the attributes before making your decision.
|
142 |
+
Note: Missing values (N/A or \"nan\") should not be used as a basis for your decision.
|
143 |
+
Record A: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}...]
|
144 |
+
Record B: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}...]
|
145 |
+
Are record A and record B the same entity?
|
146 |
+
After your reasoning, finish your response in a separate line with and ONLY with your final answer. Choose your final answer from [Yes, No].
|
147 |
+
```
|
148 |
+
#### For Data Imputation
|
149 |
|
150 |
+
#### For Error Detection
|
151 |
|
152 |
+
#### For Schema Matching
|
153 |
|
|
|
154 |
|
155 |
<!--
|
156 |
## Bias, Risks, and Limitations
|