feras-vbrl commited on
Commit
d876e1d
·
verified ·
1 Parent(s): 97e81e6

Upload 7 files

Browse files
Files changed (4) hide show
  1. DEPLOY.md +115 -0
  2. README.md +73 -0
  3. app.py +413 -0
  4. requirements.txt +11 -0
DEPLOY.md ADDED
@@ -0,0 +1,115 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Deploying SigmaTriple to Hugging Face Spaces
2
+
3
+ This guide will help you deploy the SigmaTriple application to Hugging Face Spaces.
4
+
5
+ ## Prerequisites
6
+
7
+ 1. A Hugging Face account (sign up at [huggingface.co](https://huggingface.co/join))
8
+ 2. Git installed on your local machine
9
+ 3. Hugging Face CLI (optional, for command line deployment)
10
+
11
+ ## Deployment Steps
12
+
13
+ ### Option 1: Using the Hugging Face Web Interface
14
+
15
+ 1. **Create a New Space**:
16
+ - Go to [huggingface.co/spaces](https://huggingface.co/spaces)
17
+ - Click on "Create new Space"
18
+ - Enter a name for your Space (e.g., "SigmaTriple")
19
+ - Select "Streamlit" as the SDK
20
+ - Choose "Public" or "Private" visibility
21
+ - Select "T4" as the Hardware (GPU is recommended for this application)
22
+ - Click "Create Space"
23
+
24
+ 2. **Upload Files**:
25
+ - You can either upload the files directly through the web interface
26
+ - Or clone the Space repository and push the files using Git (recommended)
27
+
28
+ 3. **Git Deployment**:
29
+ ```bash
30
+ # Clone your new Space repository
31
+ git clone https://huggingface.co/spaces/YOUR_USERNAME/SigmaTriple
32
+
33
+ # Copy all files from this project to the cloned repository
34
+ cp -r * /path/to/cloned/repo/
35
+ cp -r .streamlit /path/to/cloned/repo/
36
+ cp .gitignore /path/to/cloned/repo/
37
+
38
+ # Navigate to the cloned repository
39
+ cd /path/to/cloned/repo
40
+
41
+ # Add all files
42
+ git add .
43
+
44
+ # Commit the changes
45
+ git commit -m "Initial commit of SigmaTriple application"
46
+
47
+ # Push to Hugging Face Spaces
48
+ git push
49
+ ```
50
+
51
+ 4. **Wait for Deployment**:
52
+ - Hugging Face will automatically build and deploy your Space
53
+ - This may take a few minutes, especially for the first deployment
54
+ - You can monitor the build process in the "Settings" tab of your Space
55
+
56
+ ### Option 2: Using Hugging Face CLI
57
+
58
+ 1. **Install the Hugging Face CLI**:
59
+ ```bash
60
+ pip install huggingface_hub
61
+ ```
62
+
63
+ 2. **Login to Hugging Face**:
64
+ ```bash
65
+ huggingface-cli login
66
+ ```
67
+
68
+ 3. **Create a New Space**:
69
+ ```bash
70
+ huggingface-cli repo create SigmaTriple --type space --sdk streamlit
71
+ ```
72
+
73
+ 4. **Clone and Push**:
74
+ ```bash
75
+ git clone https://huggingface.co/spaces/YOUR_USERNAME/SigmaTriple
76
+ cp -r * /path/to/cloned/repo/
77
+ cp -r .streamlit /path/to/cloned/repo/
78
+ cp .gitignore /path/to/cloned/repo/
79
+ cd /path/to/cloned/repo
80
+ git add .
81
+ git commit -m "Initial commit of SigmaTriple application"
82
+ git push
83
+ ```
84
+
85
+ ## Configuration Options
86
+
87
+ You can customize your Space by modifying the following files:
88
+
89
+ - `.streamlit/config.toml`: Streamlit configuration
90
+ - `README.md`: Documentation and Space description
91
+ - `requirements.txt`: Python dependencies
92
+ - `packages.txt`: System dependencies
93
+
94
+ ## Troubleshooting
95
+
96
+ If you encounter any issues during deployment:
97
+
98
+ 1. **Check the Build Logs**:
99
+ - Go to the "Settings" tab of your Space
100
+ - Look for any error messages in the build logs
101
+
102
+ 2. **Common Issues**:
103
+ - **Memory Errors**: The model requires significant memory. Make sure you're using a GPU instance.
104
+ - **Dependency Issues**: Check that all required packages are listed in requirements.txt and packages.txt.
105
+ - **Timeout Errors**: The initial model loading might take time. Hugging Face Spaces has a build timeout of 10 minutes.
106
+
107
+ 3. **Reduce Model Size**:
108
+ - If you're experiencing memory issues, you can modify app.py to use a smaller model or implement model loading optimizations.
109
+
110
+ ## Accessing Your Space
111
+
112
+ Once deployed, your Space will be available at:
113
+ `https://huggingface.co/spaces/YOUR_USERNAME/SigmaTriple`
114
+
115
+ You can share this URL with others to let them use your application.
README.md ADDED
@@ -0,0 +1,73 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: SigmaTriple
3
+ emoji: 🔍
4
+ colorFrom: blue
5
+ colorTo: indigo
6
+ sdk: streamlit
7
+ sdk_version: "1.32.0"
8
+ app_file: app.py
9
+ pinned: false
10
+ ---
11
+
12
+ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
13
+
14
+ # SigmaTriple: Knowledge Graph Extraction from Markdown
15
+
16
+ This Hugging Face Space provides a Streamlit interface for extracting knowledge graphs from markdown text using the [SciPhi/Triplex](https://huggingface.co/sciphi/triplex) model.
17
+
18
+ ## Features
19
+
20
+ - **Extract Knowledge Graphs**: Automatically identify entities and relationships from markdown text
21
+ - **Customizable Entity Types and Predicates**: Define the types of entities and relationships you want to extract
22
+ - **Batch Processing**: Process large markdown files efficiently using vllm
23
+ - **Interactive Visualization**: View the extracted knowledge graph as an interactive network diagram
24
+ - **File Upload Support**: Upload markdown files directly or input text manually
25
+
26
+ ## How It Works
27
+
28
+ 1. The application uses the SciPhi/Triplex model, which is fine-tuned for knowledge graph extraction
29
+ 2. Markdown text is processed to extract plain text content
30
+ 3. For large texts, batch processing is applied with overlapping chunks to ensure context is maintained
31
+ 4. The model identifies entities and relationships based on the specified entity types and predicates
32
+ 5. Results are parsed and visualized as an interactive knowledge graph
33
+
34
+ ## Usage
35
+
36
+ 1. **Configure Entity Types and Predicates**:
37
+ - In the sidebar, customize the entity types (e.g., PERSON, ORGANIZATION) and predicates (e.g., WORKS_AT, FOUNDED) you want to extract
38
+
39
+ 2. **Input Text**:
40
+ - Choose between direct text input or file upload
41
+ - For text input, simply paste your markdown text in the provided area
42
+ - For file upload, select a markdown (.md), markdown (.markdown), or text (.txt) file
43
+
44
+ 3. **Extract Knowledge Graph**:
45
+ - Click the "Extract Knowledge Graph" button to process the text
46
+ - View the raw model output, extracted triplets table, and interactive visualization
47
+
48
+ ## Technical Details
49
+
50
+ - Uses the SciPhi/Triplex model for knowledge graph extraction
51
+ - Implements vllm for efficient batch processing when available
52
+ - Falls back to standard transformers library if vllm is not available
53
+ - Visualizes knowledge graphs using NetworkX and PyVis
54
+
55
+ ## Example Use Cases
56
+
57
+ - **Research Papers**: Extract key concepts and relationships from academic papers
58
+ - **Documentation**: Create knowledge graphs from technical documentation
59
+ - **Content Analysis**: Identify key entities and relationships in articles or blog posts
60
+ - **Educational Content**: Visualize relationships between concepts in educational materials
61
+
62
+ ## Limitations
63
+
64
+ - The quality of extraction depends on the clarity and structure of the input text
65
+ - Very large documents may require significant processing time
66
+ - The model may not capture all relationships, especially those requiring deep contextual understanding
67
+
68
+ ## Credits
69
+
70
+ - [SciPhi/Triplex Model](https://huggingface.co/sciphi/triplex)
71
+ - [vllm](https://github.com/vllm-project/vllm) for efficient batch processing
72
+ - [Streamlit](https://streamlit.io/) for the web interface
73
+ - [NetworkX](https://networkx.org/) and [PyVis](https://pyvis.readthedocs.io/) for graph visualization
app.py ADDED
@@ -0,0 +1,413 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ import json
3
+ import torch
4
+ import os
5
+ import tempfile
6
+ import networkx as nx
7
+ from pyvis.network import Network
8
+ import markdown
9
+ import time
10
+ from transformers import AutoModelForCausalLM, AutoTokenizer
11
+
12
+ # Try to import vllm, but don't fail if it's not available
13
+ try:
14
+ from vllm import LLM, SamplingParams
15
+ VLLM_AVAILABLE = True
16
+ except ImportError:
17
+ VLLM_AVAILABLE = False
18
+
19
+ # Set page configuration
20
+ st.set_page_config(
21
+ page_title="SigmaTriple - Knowledge Graph Extractor",
22
+ page_icon="🔍",
23
+ layout="wide"
24
+ )
25
+
26
+ # Cache the model loading to avoid reloading on each interaction
27
+ @st.cache_resource
28
+ def load_model():
29
+ with st.spinner("Loading model with vllm for T4 GPU..."):
30
+ # Check if GPU is available
31
+ gpu_available = torch.cuda.is_available()
32
+ st.info(f"GPU available: {gpu_available}")
33
+
34
+ # Optimized for T4 GPU with vllm
35
+ if gpu_available and VLLM_AVAILABLE:
36
+ try:
37
+ # Configure vllm for T4 GPU
38
+ model = LLM(
39
+ model="sciphi/triplex",
40
+ trust_remote_code=True,
41
+ tensor_parallel_size=1,
42
+ gpu_memory_utilization=0.9, # Higher utilization for T4
43
+ max_model_len=8192, # Increased context length
44
+ )
45
+ tokenizer = AutoTokenizer.from_pretrained("sciphi/triplex", trust_remote_code=True)
46
+ st.success("✅ Successfully loaded model with vllm on T4 GPU")
47
+ return model, tokenizer, True # True indicates vllm is used
48
+ except Exception as e:
49
+ st.warning(f"Failed to load model with vllm: {e}. Falling back to standard transformers.")
50
+ else:
51
+ if not VLLM_AVAILABLE:
52
+ st.warning("vllm is not available. Using standard transformers.")
53
+ elif not gpu_available:
54
+ st.warning("No GPU available. vllm requires a GPU. Using standard transformers.")
55
+
56
+ # Fallback to standard transformers
57
+ device = "cuda" if gpu_available else "cpu"
58
+ st.info(f"Loading model on {device} using standard transformers.")
59
+
60
+ # Load with standard transformers
61
+ if device == "cuda":
62
+ # Optimized for GPU
63
+ model = AutoModelForCausalLM.from_pretrained(
64
+ "sciphi/triplex",
65
+ trust_remote_code=True,
66
+ device_map="auto",
67
+ torch_dtype=torch.float16 # Use half precision for better GPU performance
68
+ )
69
+ else:
70
+ # CPU fallback with quantization
71
+ try:
72
+ from transformers import BitsAndBytesConfig
73
+ quantization_config = BitsAndBytesConfig(load_in_8bit=True)
74
+ model = AutoModelForCausalLM.from_pretrained(
75
+ "sciphi/triplex",
76
+ trust_remote_code=True,
77
+ device_map=None,
78
+ quantization_config=quantization_config
79
+ )
80
+ except Exception as e:
81
+ st.warning(f"Failed to load 8-bit model: {e}. Using standard model.")
82
+ model = AutoModelForCausalLM.from_pretrained(
83
+ "sciphi/triplex",
84
+ trust_remote_code=True,
85
+ device_map=None
86
+ )
87
+
88
+ # Move model to appropriate device if needed
89
+ if 'device_map' not in locals() or device_map is None:
90
+ model = model.to(device)
91
+
92
+ tokenizer = AutoTokenizer.from_pretrained("sciphi/triplex", trust_remote_code=True)
93
+ return model, tokenizer, False # False indicates standard transformers is used
94
+
95
+ def triplextract(model, tokenizer, text, entity_types, predicates, use_vllm=True):
96
+ input_format = """Perform Named Entity Recognition (NER) and extract knowledge graph triplets from the text. NER identifies named entities of given entity types, and triple extraction identifies relationships between entities using specified predicates.
97
+
98
+ **Entity Types:**
99
+ {entity_types}
100
+
101
+ **Predicates:**
102
+ {predicates}
103
+
104
+ **Text:**
105
+ {text}
106
+ """
107
+
108
+ message = input_format.format(
109
+ entity_types = json.dumps({"entity_types": entity_types}),
110
+ predicates = json.dumps({"predicates": predicates}),
111
+ text = text)
112
+
113
+ start_time = time.time()
114
+
115
+ if use_vllm and VLLM_AVAILABLE:
116
+ # Use vllm for inference
117
+ sampling_params = SamplingParams(
118
+ temperature=0.0,
119
+ max_tokens=2048,
120
+ )
121
+ outputs = model.generate([message], sampling_params)
122
+ output = outputs[0].outputs[0].text
123
+ else:
124
+ # Use standard transformers
125
+ messages = [{'role': 'user', 'content': message}]
126
+ device = next(model.parameters()).device # Get the device the model is on
127
+ input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(device)
128
+ output = tokenizer.decode(model.generate(input_ids=input_ids, max_length=2048)[0], skip_special_tokens=True)
129
+
130
+ processing_time = time.time() - start_time
131
+ st.info(f"Processing time: {processing_time:.2f} seconds")
132
+
133
+ return output
134
+
135
+ def batch_process_markdown(model, tokenizer, markdown_text, entity_types, predicates, use_vllm=True, chunk_size=1000, overlap=100):
136
+ """Process large markdown text in batches"""
137
+ # Convert markdown to plain text
138
+ html = markdown.markdown(markdown_text)
139
+ from bs4 import BeautifulSoup
140
+ text = BeautifulSoup(html, features="html.parser").get_text()
141
+
142
+ # Split text into chunks with overlap
143
+ chunks = []
144
+ for i in range(0, len(text), chunk_size - overlap):
145
+ chunk = text[i:i + chunk_size]
146
+ chunks.append(chunk)
147
+
148
+ # If there are too many chunks, inform the user
149
+ if len(chunks) > 20:
150
+ st.info(f"📊 Your text will be processed in {len(chunks)} chunks.")
151
+
152
+ # Process each chunk with progress bar
153
+ all_results = []
154
+ progress_bar = st.progress(0)
155
+ status_text = st.empty()
156
+ time_estimate = st.empty()
157
+
158
+ # Process first chunk to estimate time
159
+ start_time = time.time()
160
+
161
+ for i, chunk in enumerate(chunks):
162
+ # Update progress
163
+ progress = (i + 1) / len(chunks)
164
+ progress_bar.progress(progress)
165
+ status_text.text(f"Processing chunk {i+1}/{len(chunks)} ({int(progress*100)}%)")
166
+
167
+ # Process chunk with timeout protection
168
+ try:
169
+ with st.spinner(f"Processing chunk {i+1}/{len(chunks)}..."):
170
+ chunk_start_time = time.time()
171
+ result = triplextract(model, tokenizer, chunk, entity_types, predicates, use_vllm)
172
+ chunk_time = time.time() - chunk_start_time
173
+
174
+ # After first chunk, estimate total time
175
+ if i == 0:
176
+ estimated_total_time = chunk_time * len(chunks)
177
+ time_estimate.info(f"⏱️ Estimated total processing time: {estimated_total_time:.1f} seconds ({estimated_total_time/60:.1f} minutes)")
178
+
179
+ all_results.append(result)
180
+
181
+ # Show time taken for this chunk
182
+ st.success(f"✅ Chunk {i+1}/{len(chunks)} processed in {chunk_time:.1f} seconds")
183
+ except Exception as e:
184
+ st.error(f"Error processing chunk {i+1}: {e}")
185
+ all_results.append(f"Error processing this chunk: {e}")
186
+
187
+ # Show total time taken
188
+ total_time = time.time() - start_time
189
+ st.info(f"Total processing time: {total_time:.1f} seconds ({total_time/60:.1f} minutes)")
190
+
191
+ # Clear progress indicators
192
+ progress_bar.empty()
193
+ status_text.empty()
194
+ time_estimate.empty()
195
+
196
+ # Combine results
197
+ combined_result = "\n\n".join(all_results)
198
+ return combined_result
199
+
200
+ def parse_triplets(output):
201
+ """Parse the model output to extract triplets"""
202
+ try:
203
+ # Find the JSON part in the output
204
+ start_idx = output.find('{')
205
+ end_idx = output.rfind('}') + 1
206
+
207
+ if start_idx != -1 and end_idx != -1:
208
+ json_str = output[start_idx:end_idx]
209
+ data = json.loads(json_str)
210
+ return data
211
+ else:
212
+ # If no JSON found, try to parse the text format
213
+ triplets = []
214
+ lines = output.split('\n')
215
+ for line in lines:
216
+ if '->' in line and '<-' in line:
217
+ parts = line.split('->')
218
+ if len(parts) >= 2:
219
+ subject = parts[0].strip()
220
+ rest = parts[1].split('<-')
221
+ if len(rest) >= 2:
222
+ predicate = rest[0].strip()
223
+ object_ = rest[1].strip()
224
+ triplets.append({
225
+ "subject": subject,
226
+ "predicate": predicate,
227
+ "object": object_
228
+ })
229
+
230
+ if triplets:
231
+ return {"triplets": triplets}
232
+
233
+ # If still no triplets found, return empty result
234
+ return {"triplets": []}
235
+ except Exception as e:
236
+ st.error(f"Error parsing triplets: {e}")
237
+ return {"triplets": []}
238
+
239
+ def visualize_knowledge_graph(triplets):
240
+ """Create a network visualization of the knowledge graph"""
241
+ G = nx.DiGraph()
242
+
243
+ # Add nodes and edges
244
+ for triplet in triplets:
245
+ subject = triplet.get("subject", "")
246
+ predicate = triplet.get("predicate", "")
247
+ object_ = triplet.get("object", "")
248
+
249
+ if subject and object_:
250
+ G.add_node(subject)
251
+ G.add_node(object_)
252
+ G.add_edge(subject, object_, title=predicate, label=predicate)
253
+
254
+ # Create pyvis network
255
+ net = Network(notebook=True, height="600px", width="100%", directed=True)
256
+
257
+ # Add nodes with different colors based on type if available
258
+ for node in G.nodes():
259
+ net.add_node(node, label=node, title=node)
260
+
261
+ # Add edges
262
+ for edge in G.edges(data=True):
263
+ net.add_edge(edge[0], edge[1], title=edge[2].get('title', ''), label=edge[2].get('label', ''))
264
+
265
+ # Generate HTML file
266
+ with tempfile.NamedTemporaryFile(delete=False, suffix='.html') as tmp:
267
+ net.save_graph(tmp.name)
268
+ return tmp.name
269
+
270
+ def main():
271
+ st.title("🔍 SigmaTriple - Knowledge Graph Extractor")
272
+ st.markdown("""
273
+ Extract knowledge graphs from markdown text using the SciPhi/Triplex model.
274
+ """)
275
+
276
+ # Load model (spinner is inside the load_model function)
277
+ model, tokenizer, use_vllm = load_model()
278
+
279
+ # Add a note about performance
280
+ if torch.cuda.is_available():
281
+ st.success("""
282
+ 🚀 Running on GPU with vllm for optimal performance!
283
+ """)
284
+ else:
285
+ st.warning("""
286
+ ⚠️ You are running on CPU which can be very slow for the SciPhi/Triplex model.
287
+ Processing may take 10+ minutes for even small texts.
288
+ """)
289
+
290
+ # Sidebar for configuration
291
+ st.sidebar.title("Configuration")
292
+
293
+ # Entity types and predicates input
294
+ st.sidebar.subheader("Entity Types")
295
+ entity_types_default = ["PERSON", "ORGANIZATION", "LOCATION", "DATE", "EVENT", "PRODUCT", "TECHNOLOGY"]
296
+ entity_types_input = st.sidebar.text_area("Enter entity types (one per line)",
297
+ "\n".join(entity_types_default),
298
+ height=150)
299
+ entity_types = [et.strip() for et in entity_types_input.split("\n") if et.strip()]
300
+
301
+ st.sidebar.subheader("Predicates")
302
+ predicates_default = ["WORKS_AT", "LOCATED_IN", "FOUNDED", "DEVELOPED", "USES", "RELATED_TO", "PART_OF", "CREATED", "MEMBER_OF"]
303
+ predicates_input = st.sidebar.text_area("Enter predicates (one per line)",
304
+ "\n".join(predicates_default),
305
+ height=150)
306
+ predicates = [p.strip() for p in predicates_input.split("\n") if p.strip()]
307
+
308
+ # Add option to adjust chunk size
309
+ st.sidebar.subheader("Performance Settings")
310
+ chunk_size = st.sidebar.slider("Chunk Size", 500, 2000, 1000,
311
+ help="Larger chunks capture more context but may take longer to process")
312
+
313
+ # Input method selection
314
+ input_method = st.radio("Select input method:", ["Text Input", "File Upload"])
315
+
316
+ if input_method == "Text Input":
317
+ markdown_text = st.text_area("Enter markdown text:", height=300)
318
+ process_button = st.button("Extract Knowledge Graph")
319
+
320
+ if process_button and markdown_text:
321
+ with st.spinner("Processing text..."):
322
+ result = batch_process_markdown(model, tokenizer, markdown_text, entity_types, predicates, use_vllm, chunk_size=chunk_size)
323
+
324
+ # Display raw output in an expandable section
325
+ with st.expander("Raw Model Output"):
326
+ st.text(result)
327
+
328
+ # Parse and visualize triplets
329
+ parsed_data = parse_triplets(result)
330
+ triplets = parsed_data.get("triplets", [])
331
+
332
+ if triplets:
333
+ st.subheader(f"Extracted {len(triplets)} Knowledge Graph Triplets:")
334
+
335
+ # Display triplets in a table
336
+ triplet_data = []
337
+ for t in triplets:
338
+ triplet_data.append({
339
+ "Subject": t.get("subject", ""),
340
+ "Predicate": t.get("predicate", ""),
341
+ "Object": t.get("object", "")
342
+ })
343
+
344
+ st.table(triplet_data)
345
+
346
+ # Visualize the knowledge graph
347
+ if len(triplets) > 0:
348
+ html_file = visualize_knowledge_graph(triplets)
349
+ st.subheader("Knowledge Graph Visualization:")
350
+ st.components.v1.html(open(html_file, 'r').read(), height=600)
351
+ os.unlink(html_file) # Clean up the temporary file
352
+ else:
353
+ st.warning("No triplets were extracted from the text.")
354
+
355
+ else: # File Upload
356
+ uploaded_file = st.file_uploader("Upload a markdown file", type=["md", "markdown", "txt"])
357
+
358
+ if uploaded_file is not None:
359
+ markdown_text = uploaded_file.read().decode("utf-8")
360
+ st.subheader("File Preview:")
361
+ with st.expander("Show file content"):
362
+ st.markdown(markdown_text)
363
+
364
+ process_button = st.button("Extract Knowledge Graph")
365
+
366
+ if process_button:
367
+ with st.spinner("Processing file..."):
368
+ result = batch_process_markdown(model, tokenizer, markdown_text, entity_types, predicates, use_vllm, chunk_size=chunk_size)
369
+
370
+ # Display raw output in an expandable section
371
+ with st.expander("Raw Model Output"):
372
+ st.text(result)
373
+
374
+ # Parse and visualize triplets
375
+ parsed_data = parse_triplets(result)
376
+ triplets = parsed_data.get("triplets", [])
377
+
378
+ if triplets:
379
+ st.subheader(f"Extracted {len(triplets)} Knowledge Graph Triplets:")
380
+
381
+ # Display triplets in a table
382
+ triplet_data = []
383
+ for t in triplets:
384
+ triplet_data.append({
385
+ "Subject": t.get("subject", ""),
386
+ "Predicate": t.get("predicate", ""),
387
+ "Object": t.get("object", "")
388
+ })
389
+
390
+ st.table(triplet_data)
391
+
392
+ # Visualize the knowledge graph
393
+ if len(triplets) > 0:
394
+ html_file = visualize_knowledge_graph(triplets)
395
+ st.subheader("Knowledge Graph Visualization:")
396
+ st.components.v1.html(open(html_file, 'r').read(), height=600)
397
+ os.unlink(html_file) # Clean up the temporary file
398
+ else:
399
+ st.warning("No triplets were extracted from the file.")
400
+
401
+ # Add information about the model
402
+ st.sidebar.markdown("---")
403
+ st.sidebar.subheader("About")
404
+ st.sidebar.info("""
405
+ This app uses the SciPhi/Triplex model to extract knowledge graphs from text.
406
+
407
+ The model performs Named Entity Recognition (NER) and extracts relationships between entities.
408
+
409
+ Using vllm: {}
410
+ """.format("Yes" if use_vllm else "No (using standard transformers)"))
411
+
412
+ if __name__ == "__main__":
413
+ main()
requirements.txt ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ streamlit==1.32.0
2
+ transformers==4.38.2
3
+ torch==2.1.2
4
+ vllm==0.3.0
5
+ accelerate==0.27.2
6
+ bitsandbytes==0.41.1
7
+ markdown==3.5.2
8
+ pydantic==2.5.2
9
+ networkx==3.2.1
10
+ pyvis==0.3.2
11
+ beautifulsoup4==4.12.2