NECOUDBFM
/

Jellyfish-13B

Text Generation

Transformers

PyTorch

English

llama

text-generation-inference

Model card Files Files and versions Community

HCZhang commited on Oct 25, 2023

Commit

bdb2316

1 Parent(s): ecf2eab

Update README.md

Browse files

Files changed (1) hide show

README.md +67 -17

README.md CHANGED Viewed

@@ -30,19 +30,26 @@ Note that Jellyfish is only a 13B model and can be run locally for low cost and
 | Error Detection  |  Adult         | 94.40| 92.01 | 92.01 | 96.62 | 90.13 |
 | Schema Matching  |  Sythea        | 38.50| 57.14 | 66.67 | 36.36 | 30.77 |
-_Accuracy as the metric for data imputation, and the f1 score for other tasks._
 1.
   [Ditto](https://arxiv.org/abs/2004.00584) for Entity Matching
   [SMAT](https://www.researchgate.net/publication/353920530_SMAT_An_Attention-Based_Deep_Learning_Solution_to_the_Automation_of_Schema_Matching) for Schema Matching
   [HoloDetect](https://arxiv.org/abs/1904.02285) for Error Detection
   [HoloClean](https://arxiv.org/abs/1702.00820) for Data Imputation
-2. [Large Language Models as Data Preprocessors](https://arxiv.org/abs/2308.16361)
 We release two versions of Jellyfish: the Jellyfish-13B (the main branch) and Jellyfish-13B-Reasoning.
 As the names suggest, Jellyfish-13B focuses on providing accurate, direct answers.
 In contrast, Jellyfish-13B-Reasoning distills knowledge from GPT-4. It fine-tuned with data containing reasons and chain-of-thought responses for solving data preprocessing tasks
 generated by GPT-4.
 **Jellyfish paper will be coming soon!**
 - **Developed by:** Haochen Zhang, Yuyang Dong, Chuan Xiao, Masafumi Oyamada
@@ -77,30 +84,73 @@ We used LoRA to speed up the training process, targeting the q_proj and v_proj m
 <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 Here are the prompts we used for both fine-tuning the model and for inference. Feel free to explore different prompts on your own to achieve the best generation quality.
-### For JellyFish-13B
 ```
-You are tasked with determining whether two records listed below are the same based on the information provided. Carefully compare the {attribute 1}, {attribute 2}... for each record before making your decision.
-Note: Missing values (N/A or \"nan\") should not be used as a basis for your decision.
-Record A: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}...]\nProduct B: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}...]
 Are record A and record B the same entity? Choose your answer from: [Yes, No]
 ```
 ### For JellyFish-13B-reasoning
 ```
-You are tasked with determining whether two products listed below are the same based on the information provided. Carefully examine all the attributes before making your decision.
-Note: Missing values (N/A or \"nan\") should not be used as a basis for your decision.
-Record A: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}...]\nProduct B: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}...]
-Are record A and record B the same entity?
-After your reasoning, finish your response in a separate line with and ONLY with your final answer. Choose your final answer from [Yes, No].",
-```
 <!--
 ## Bias, Risks, and Limitations

 | Error Detection  |  Adult         | 94.40| 92.01 | 92.01 | 96.62 | 90.13 |
 | Schema Matching  |  Sythea        | 38.50| 57.14 | 66.67 | 36.36 | 30.77 |
+_Accuracy as the metric for data imputation, and the F1 score for other tasks._
+_For GPT-3.5, we used the few-shot approach, while for Jellyfish and Jellyfish-Reasoning, the zero-shot approach was employed._
 1.
   [Ditto](https://arxiv.org/abs/2004.00584) for Entity Matching
   [SMAT](https://www.researchgate.net/publication/353920530_SMAT_An_Attention-Based_Deep_Learning_Solution_to_the_Automation_of_Schema_Matching) for Schema Matching
   [HoloDetect](https://arxiv.org/abs/1904.02285) for Error Detection
   [HoloClean](https://arxiv.org/abs/1702.00820) for Data Imputation
+2.
+  [Large Language Models as Data Preprocessors](https://arxiv.org/abs/2308.16361)
 We release two versions of Jellyfish: the Jellyfish-13B (the main branch) and Jellyfish-13B-Reasoning.
 As the names suggest, Jellyfish-13B focuses on providing accurate, direct answers.
 In contrast, Jellyfish-13B-Reasoning distills knowledge from GPT-4. It fine-tuned with data containing reasons and chain-of-thought responses for solving data preprocessing tasks
 generated by GPT-4.
+The two versions are designed for different application scenarios.
+Jellyfish-13B is suitable for integration into larger data management systems due to its simple and clear responses that can be easily transformed into code.
+On the other hand, Jellyfish-13B-Reasoning is more user-oriented, with responses that offer deeper insights into the data.
 **Jellyfish paper will be coming soon!**
 - **Developed by:** Haochen Zhang, Yuyang Dong, Chuan Xiao, Masafumi Oyamada
 <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 Here are the prompts we used for both fine-tuning the model and for inference. Feel free to explore different prompts on your own to achieve the best generation quality.
+### JellyFish-13B
+#### For Entity Matching
 ```
+You are tasked with determining whether two records listed below are the same based on the information provided.
+Carefully compare the {attribute 1}, {attribute 2}... for each record before making your decision.
+Note: Missing values (N/A or \"nan\") should not be used as a basis for your decision.
+Record A: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}...]
+Record B: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}...]
 Are record A and record B the same entity? Choose your answer from: [Yes, No]
 ```
+#### For Data Imputation
+```
+You are presented with a {keyword} record that is missing a specific attribute: {attribute X}.
+Your task is to deduce or infer the value of {attribute X} using the available information in the record.
+You may be provided with fields like {attribute 1}, {attribute 2}, ... to help you in the inference.
+Record: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
+Based on the provided record, what would you infer is the value for the missing attribute {attribute X}?
+Answer only the value of {attribute X}.
+```
+#### For Error Detection
+_There are two forms of the error detection task.
+In the first form, a complete record row is provided, and the task is to determine if a specific value is erroneous.
+In the second form, only the value of a specific attribute is given, and the decision about its correctness is based solely on the attribute's name and value.
+The subsequent prompt examples pertain to these two forms, respectively._
+```
+Your task is to determine if there is an error in the value of a specific attribute within the whole record provided.
+The attributes may include {attribute 1}, {attribute 2}, ...
+Errors may include, but are not limited to, spelling errors, inconsistencies, or values that don't make sense given the context of the whole record.
+Record [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}...]
+Attribute for Verification: [{attribute X}: {attribute X value}]
+Question: Is there an error in the value of {attribute X}? Choose your answer from: [Yes, No]
+```
+```
+Your task is to determine if there is an error in the value of a specific attribute.
+The attributes may belong to a healthcare-related record and could be one of the following: {attribute 1}, {attribute 2}, ...
+Errors can include, but are not limited to, spelling errors, inconsistencies, or values that don't make sense for that attribute.
+Note: Missing values (N/A or \"nan\") are not considered errors.
+Attribute for Verification: [{attribute X}: {attribute X value}]
+Question: Is there an error in the value of {attribute X}? Choose your answer from: [Yes, No]
+```
+#### For Schema Matching
+```
+Your task is to determine if the two attributes (columns) are semantically equivalent in the context of merging two tables.
+Each attribute will be provided by its name and a brief description.
+Your goal is to assess if they refer to the same information based on these names and descriptions provided.
+Attribute A is [name: {the value of name}, description: {the value of description}].
+Attribute B is [name: {the value of name}, description: {the value of description}].
+Are Attribute A and Attribute B semantically equivalent? Choose your answer from: [Yes, No]
+```
 ### For JellyFish-13B-reasoning
+#### For Entity Matching
 ```
+You are tasked with determining whether two products listed below are the same based on the information provided.
+Carefully examine all the attributes before making your decision.
+Note: Missing values (N/A or \"nan\") should not be used as a basis for your decision.
+Record A: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}...]
+Record B: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}...]
+Are record A and record B the same entity?
+After your reasoning, finish your response in a separate line with and ONLY with your final answer. Choose your final answer from [Yes, No].
+```
+#### For Data Imputation
+#### For Error Detection
+#### For Schema Matching
 <!--
 ## Bias, Risks, and Limitations