HCZhang commited on
Commit
bdb2316
·
1 Parent(s): ecf2eab

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +67 -17
README.md CHANGED
@@ -30,19 +30,26 @@ Note that Jellyfish is only a 13B model and can be run locally for low cost and
30
  | Error Detection | Adult | 94.40| 92.01 | 92.01 | 96.62 | 90.13 |
31
  | Schema Matching | Sythea | 38.50| 57.14 | 66.67 | 36.36 | 30.77 |
32
 
33
- _Accuracy as the metric for data imputation, and the f1 score for other tasks._
 
34
  1.
35
  [Ditto](https://arxiv.org/abs/2004.00584) for Entity Matching
36
  [SMAT](https://www.researchgate.net/publication/353920530_SMAT_An_Attention-Based_Deep_Learning_Solution_to_the_Automation_of_Schema_Matching) for Schema Matching
37
  [HoloDetect](https://arxiv.org/abs/1904.02285) for Error Detection
38
  [HoloClean](https://arxiv.org/abs/1702.00820) for Data Imputation
39
- 2. [Large Language Models as Data Preprocessors](https://arxiv.org/abs/2308.16361)
 
 
40
 
41
  We release two versions of Jellyfish: the Jellyfish-13B (the main branch) and Jellyfish-13B-Reasoning.
42
  As the names suggest, Jellyfish-13B focuses on providing accurate, direct answers.
43
  In contrast, Jellyfish-13B-Reasoning distills knowledge from GPT-4. It fine-tuned with data containing reasons and chain-of-thought responses for solving data preprocessing tasks
44
  generated by GPT-4.
45
 
 
 
 
 
46
  **Jellyfish paper will be coming soon!**
47
 
48
  - **Developed by:** Haochen Zhang, Yuyang Dong, Chuan Xiao, Masafumi Oyamada
@@ -77,30 +84,73 @@ We used LoRA to speed up the training process, targeting the q_proj and v_proj m
77
  <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
78
  Here are the prompts we used for both fine-tuning the model and for inference. Feel free to explore different prompts on your own to achieve the best generation quality.
79
 
80
- ### For JellyFish-13B
 
81
  ```
82
- You are tasked with determining whether two records listed below are the same based on the information provided. Carefully compare the {attribute 1}, {attribute 2}... for each record before making your decision.
83
-
84
- Note: Missing values (N/A or \"nan\") should not be used as a basis for your decision.
85
-
86
- Record A: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}...]\nProduct B: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}...]
87
-
88
  Are record A and record B the same entity? Choose your answer from: [Yes, No]
89
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
90
 
91
  ### For JellyFish-13B-reasoning
 
92
  ```
93
- You are tasked with determining whether two products listed below are the same based on the information provided. Carefully examine all the attributes before making your decision.
94
-
95
- Note: Missing values (N/A or \"nan\") should not be used as a basis for your decision.
96
-
97
- Record A: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}...]\nProduct B: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}...]
 
 
 
 
98
 
99
- Are record A and record B the same entity?
100
 
101
- After your reasoning, finish your response in a separate line with and ONLY with your final answer. Choose your final answer from [Yes, No].",
102
 
103
- ```
104
 
105
  <!--
106
  ## Bias, Risks, and Limitations
 
30
  | Error Detection | Adult | 94.40| 92.01 | 92.01 | 96.62 | 90.13 |
31
  | Schema Matching | Sythea | 38.50| 57.14 | 66.67 | 36.36 | 30.77 |
32
 
33
+ _Accuracy as the metric for data imputation, and the F1 score for other tasks._
34
+ _For GPT-3.5, we used the few-shot approach, while for Jellyfish and Jellyfish-Reasoning, the zero-shot approach was employed._
35
  1.
36
  [Ditto](https://arxiv.org/abs/2004.00584) for Entity Matching
37
  [SMAT](https://www.researchgate.net/publication/353920530_SMAT_An_Attention-Based_Deep_Learning_Solution_to_the_Automation_of_Schema_Matching) for Schema Matching
38
  [HoloDetect](https://arxiv.org/abs/1904.02285) for Error Detection
39
  [HoloClean](https://arxiv.org/abs/1702.00820) for Data Imputation
40
+ 2.
41
+ [Large Language Models as Data Preprocessors](https://arxiv.org/abs/2308.16361)
42
+
43
 
44
  We release two versions of Jellyfish: the Jellyfish-13B (the main branch) and Jellyfish-13B-Reasoning.
45
  As the names suggest, Jellyfish-13B focuses on providing accurate, direct answers.
46
  In contrast, Jellyfish-13B-Reasoning distills knowledge from GPT-4. It fine-tuned with data containing reasons and chain-of-thought responses for solving data preprocessing tasks
47
  generated by GPT-4.
48
 
49
+ The two versions are designed for different application scenarios.
50
+ Jellyfish-13B is suitable for integration into larger data management systems due to its simple and clear responses that can be easily transformed into code.
51
+ On the other hand, Jellyfish-13B-Reasoning is more user-oriented, with responses that offer deeper insights into the data.
52
+
53
  **Jellyfish paper will be coming soon!**
54
 
55
  - **Developed by:** Haochen Zhang, Yuyang Dong, Chuan Xiao, Masafumi Oyamada
 
84
  <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
85
  Here are the prompts we used for both fine-tuning the model and for inference. Feel free to explore different prompts on your own to achieve the best generation quality.
86
 
87
+ ### JellyFish-13B
88
+ #### For Entity Matching
89
  ```
90
+ You are tasked with determining whether two records listed below are the same based on the information provided.
91
+ Carefully compare the {attribute 1}, {attribute 2}... for each record before making your decision.
92
+ Note: Missing values (N/A or \"nan\") should not be used as a basis for your decision.
93
+ Record A: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}...]
94
+ Record B: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}...]
 
95
  Are record A and record B the same entity? Choose your answer from: [Yes, No]
96
  ```
97
+ #### For Data Imputation
98
+ ```
99
+ You are presented with a {keyword} record that is missing a specific attribute: {attribute X}.
100
+ Your task is to deduce or infer the value of {attribute X} using the available information in the record.
101
+ You may be provided with fields like {attribute 1}, {attribute 2}, ... to help you in the inference.
102
+ Record: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
103
+ Based on the provided record, what would you infer is the value for the missing attribute {attribute X}?
104
+ Answer only the value of {attribute X}.
105
+ ```
106
+ #### For Error Detection
107
+ _There are two forms of the error detection task.
108
+ In the first form, a complete record row is provided, and the task is to determine if a specific value is erroneous.
109
+ In the second form, only the value of a specific attribute is given, and the decision about its correctness is based solely on the attribute's name and value.
110
+ The subsequent prompt examples pertain to these two forms, respectively._
111
+ ```
112
+ Your task is to determine if there is an error in the value of a specific attribute within the whole record provided.
113
+ The attributes may include {attribute 1}, {attribute 2}, ...
114
+ Errors may include, but are not limited to, spelling errors, inconsistencies, or values that don't make sense given the context of the whole record.
115
+ Record [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}...]
116
+ Attribute for Verification: [{attribute X}: {attribute X value}]
117
+ Question: Is there an error in the value of {attribute X}? Choose your answer from: [Yes, No]
118
+ ```
119
+ ```
120
+ Your task is to determine if there is an error in the value of a specific attribute.
121
+ The attributes may belong to a healthcare-related record and could be one of the following: {attribute 1}, {attribute 2}, ...
122
+ Errors can include, but are not limited to, spelling errors, inconsistencies, or values that don't make sense for that attribute.
123
+ Note: Missing values (N/A or \"nan\") are not considered errors.
124
+ Attribute for Verification: [{attribute X}: {attribute X value}]
125
+ Question: Is there an error in the value of {attribute X}? Choose your answer from: [Yes, No]
126
+ ```
127
+ #### For Schema Matching
128
+ ```
129
+ Your task is to determine if the two attributes (columns) are semantically equivalent in the context of merging two tables.
130
+ Each attribute will be provided by its name and a brief description.
131
+ Your goal is to assess if they refer to the same information based on these names and descriptions provided.
132
+ Attribute A is [name: {the value of name}, description: {the value of description}].
133
+ Attribute B is [name: {the value of name}, description: {the value of description}].
134
+ Are Attribute A and Attribute B semantically equivalent? Choose your answer from: [Yes, No]
135
+ ```
136
 
137
  ### For JellyFish-13B-reasoning
138
+ #### For Entity Matching
139
  ```
140
+ You are tasked with determining whether two products listed below are the same based on the information provided.
141
+ Carefully examine all the attributes before making your decision.
142
+ Note: Missing values (N/A or \"nan\") should not be used as a basis for your decision.
143
+ Record A: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}...]
144
+ Record B: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}...]
145
+ Are record A and record B the same entity?
146
+ After your reasoning, finish your response in a separate line with and ONLY with your final answer. Choose your final answer from [Yes, No].
147
+ ```
148
+ #### For Data Imputation
149
 
150
+ #### For Error Detection
151
 
152
+ #### For Schema Matching
153
 
 
154
 
155
  <!--
156
  ## Bias, Risks, and Limitations