GraphDB Setup & Integration

Setting up GraphDB for ontology storage and SPARQL queries

GraphDB is a powerful semantic database that serves as the backbone for storing and querying ontologies at scale. This guide covers setup, configuration, and integration with our ontology workflow.

What is GraphDB?

GraphDB is an enterprise-ready semantic graph database that:

Stores RDF triples efficiently at massive scale
Supports SPARQL 1.1 queries and updates
Performs reasoning using various algorithms
Provides REST APIs for programmatic access
Integrates with Protégé and other tools

Why GraphDB for Our Project?

GraphDB serves as the central semantic hub that connects all components of our ontology-driven AI system:

flowchart LR
    A[📝 Protégé Desktop] --> B[🗄️ GraphDB]
    D[🐍 Python Scripts] --> B
    B --> E[🤖 LLM Applications]
    B --> F[🔀 MOE Systems]
    B --> G[🔗 SPARQL Endpoints]
    
    classDef input fill:#e3f2fd,stroke:#1976d2,stroke-width:2px,color:#1565c0
    classDef database fill:#fff3e0,stroke:#f57c00,stroke-width:3px,color:#ef6c00
    classDef output fill:#f1f8e9,stroke:#388e3c,stroke-width:2px,color:#2e7d32
    
    class A,D input
    class B database
    class E,F,G output

Note

GraphDB Integration Benefits:

Centralized Knowledge: Single source of truth for ontological data
SPARQL Interface: Standard query language for semantic data
Reasoning Support: Automatic inference and consistency checking
Scalability: Handles large-scale ontological datasets efficiently

Docker Setup

1. Launch GraphDB

Using the provided Docker configuration:

# Navigate to project directory
cd docker/

# Start GraphDB service
docker-compose -f docker-compose-graphdb.yml up -d

# Check if running
docker-compose -f docker-compose-graphdb.yml ps

2. Access GraphDB Workbench

Open your browser and navigate to:

http://localhost:7200

Default credentials (first time setup): - Username: admin - Password: admin

3. Docker Configuration Details

# docker-compose-graphdb.yml
version: '3.8'
services:
  graphdb:
    image: ontotext/graphdb:10.0.0
    ports:
      - "7200:7200"
    volumes:
      - graphdb-data:/opt/graphdb/home
    environment:
      - GDB_JAVA_OPTS=-Xmx2g
volumes:
  graphdb-data:

Repository Setup

1. Create a New Repository

Access Workbench: Go to http://localhost:7200
Setup Repositories: Click “Setup” → “Repositories”
Create Repository: Click “Create new repository”
Repository Type: Select “GraphDB Repository”

Configuration:

Repository ID: plant-ontology
Repository title: Plant Disease Ontology Repository
Storage folder: (leave default)
Base URL: http://example.org/plants/

2. Repository Settings

Advanced Settings for optimal performance:

# Reasoning
Enable RDFS/OWL reasoning: true
Reasoning level: OWL-Horst (optimized)

# Query timeout
Query timeout: 60 seconds

# Memory settings
Entity pool size: 200000
Statement indices: posc,pso,osp,spo

Importing Ontologies

Method 1: Web Interface

Navigate to repository: Select “plant-ontology”
Import data: Go to “Import” → “RDF”

Upload files:

- pizza.owl
- PizzaTutorial.rdf  
- your_custom_ontology.owl

Import settings:
- Base URI: Keep default or set custom
- Named graphs: Optional grouping
- Processing: Enable reasoning

Method 2: Programmatic Import

import requests
import os

# GraphDB connection details
GRAPHDB_URL = "http://localhost:7200"
REPOSITORY = "plant-ontology"

def upload_ontology(file_path, context_uri=None):
    """Upload an ontology file to GraphDB"""
    
    # Prepare upload URL
    upload_url = f"{GRAPHDB_URL}/repositories/{REPOSITORY}/statements"
    
    # Headers for RDF data
    headers = {
        'Content-Type': 'application/rdf+xml'
    }
    
    # Add context (named graph) if specified
    params = {}
    if context_uri:
        params['context'] = f"<{context_uri}>"
    
    # Read and upload file
    with open(file_path, 'rb') as f:
        response = requests.post(
            upload_url,
            headers=headers,
            params=params,
            data=f.read()
        )
    
    if response.status_code == 204:
        print(f"✅ Successfully uploaded {file_path}")
    else:
        print(f"❌ Failed to upload {file_path}: {response.text}")

# Example usage
upload_ontology("ontologies/pizza.owl", "http://pizza.org")
upload_ontology("ontologies/plant_disease.owl", "http://plants.org")

Method 3: SPARQL Update

# Load ontology via SPARQL
LOAD <file:///path/to/ontology.owl> INTO GRAPH <http://example.org/plants>

# Or from URL
LOAD <https://example.org/remote_ontology.owl> INTO GRAPH <http://example.org/remote>

SPARQL Queries

Basic Queries

# List all classes
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?class ?label WHERE {
    ?class rdf:type rdfs:Class .
    OPTIONAL { ?class rdfs:label ?label }
}
LIMIT 50

# Find plant diseases
PREFIX plant: <http://example.org/plants/>
PREFIX disease: <http://example.org/diseases/>

SELECT ?plant ?disease WHERE {
    ?plant plant:hasDisease ?disease .
    ?plant rdf:type plant:Crop .
}

Advanced Reasoning Queries

# Infer treatment recommendations
PREFIX plant: <http://example.org/plants/>
PREFIX treatment: <http://example.org/treatments/>

SELECT ?plant ?disease ?treatment WHERE {
    ?plant plant:hasDisease ?disease .
    ?disease rdf:type ?diseaseType .
    ?diseaseType treatment:recommendedTreatment ?treatment .
}

Python Integration

Setup SPARQLWrapper

from SPARQLWrapper import SPARQLWrapper, JSON, POST, GET
import json

class GraphDBConnector:
    def __init__(self, endpoint="http://localhost:7200/repositories/plant-ontology"):
        self.endpoint = endpoint
        self.sparql = SPARQLWrapper(endpoint)
        self.update_endpoint = endpoint + "/statements"
    
    def query(self, sparql_query):
        """Execute SPARQL SELECT query"""
        self.sparql.setQuery(sparql_query)
        self.sparql.setReturnFormat(JSON)
        self.sparql.setMethod(GET)
        
        try:
            results = self.sparql.query().convert()
            return results["results"]["bindings"]
        except Exception as e:
            print(f"Query error: {e}")
            return []
    
    def update(self, sparql_update):
        """Execute SPARQL UPDATE query"""
        self.sparql.setQuery(sparql_update)
        self.sparql.setMethod(POST)
        
        try:
            self.sparql.query()
            return True
        except Exception as e:
            print(f"Update error: {e}")
            return False
    
    def get_all_classes(self):
        """Retrieve all ontology classes"""
        query = """
        PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
        PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
        
        SELECT ?class ?label WHERE {
            ?class rdf:type rdfs:Class .
            OPTIONAL { ?class rdfs:label ?label }
        }
        ORDER BY ?class
        """
        return self.query(query)
    
    def find_plant_diseases(self, plant_type=None):
        """Find diseases affecting plants"""
        filter_clause = ""
        if plant_type:
            filter_clause = f"FILTER (?plantType = <{plant_type}>)"
        
        query = f"""
        PREFIX plant: <http://example.org/plants/>
        PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
        
        SELECT ?plant ?plantType ?disease WHERE {{
            ?plant plant:hasDisease ?disease .
            ?plant rdf:type ?plantType .
            {filter_clause}
        }}
        """
        return self.query(query)

# Usage example
db = GraphDBConnector()

# Get all classes
classes = db.get_all_classes()
for cls in classes:
    print(f"Class: {cls.get('class', {}).get('value', '')}")
    print(f"Label: {cls.get('label', {}).get('value', 'No label')}")

# Find diseases
diseases = db.find_plant_diseases()
for disease in diseases:
    print(f"Plant: {disease['plant']['value']}")
    print(f"Disease: {disease['disease']['value']}")

Integration with Protégé

1. Connect Protégé to GraphDB

In Protégé Desktop:

File → New → Create from database

Connection settings:

JDBC URL: jdbc:graphdb:http://localhost:7200/repositories/plant-ontology
Driver: GraphDB JDBC Driver
Username: admin
Password: admin

Or use SPARQL endpoint:

SPARQL Endpoint: http://localhost:7200/repositories/plant-ontology

2. Publish from Protégé to GraphDB

Method 1: Manual Export/Import

# In Protégé: File → Export → RDF/XML
# Then upload to GraphDB via web interface

Method 2: Direct Connection

# Export from Protégé and upload programmatically
from owlready2 import *

# Load ontology from Protégé
onto = get_ontology("file://path/to/protege_ontology.owl").load()

# Convert to RDF/XML string
rdf_data = onto.as_rdf()

# Upload to GraphDB
import requests
response = requests.post(
    "http://localhost:7200/repositories/plant-ontology/statements",
    headers={'Content-Type': 'application/rdf+xml'},
    data=rdf_data
)

Pydantic Integration for Structured Data

1. Ontology-Driven Model Generation

from pydantic import BaseModel, Field
from typing import List, Optional
from enum import Enum

class DiseaseType(str, Enum):
    FUNGAL = "http://example.org/diseases/FungalDisease"
    VIRAL = "http://example.org/diseases/ViralDisease"
    BACTERIAL = "http://example.org/diseases/BacterialDisease"

class Symptom(BaseModel):
    name: str = Field(..., description="Symptom name")
    severity: int = Field(..., ge=1, le=10, description="Severity scale 1-10")
    location: str = Field(..., description="Where symptom appears")
    
    class Config:
        schema_extra = {
            "example": {
                "name": "leaf_yellowing",
                "severity": 7,
                "location": "leaves"
            }
        }

class Disease(BaseModel):
    disease_id: str = Field(..., description="Disease identifier")
    name: str = Field(..., description="Disease name")
    type: DiseaseType = Field(..., description="Disease classification")
    symptoms: List[Symptom] = Field(..., description="Associated symptoms")
    treatment: Optional[str] = Field(None, description="Recommended treatment")
    
    @classmethod
    def from_graphdb(cls, disease_uri: str, db_connector: GraphDBConnector):
        """Create Disease model from GraphDB data"""
        query = f"""
        PREFIX disease: <http://example.org/diseases/>
        PREFIX symptom: <http://example.org/symptoms/>
        PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
        
        SELECT ?name ?type ?symptom ?symptomName WHERE {{
            <{disease_uri}> rdfs:label ?name .
            <{disease_uri}> rdf:type ?type .
            <{disease_uri}> disease:hasSymptom ?symptom .
            ?symptom rdfs:label ?symptomName .
        }}
        """
        
        results = db_connector.query(query)
        
        # Process results and create Pydantic model
        if results:
            symptoms = [
                Symptom(
                    name=result['symptomName']['value'],
                    severity=5,  # Default, could be queried
                    location="plant"  # Default, could be queried
                )
                for result in results
            ]
            
            return cls(
                disease_id=disease_uri,
                name=results[0]['name']['value'],
                type=results[0]['type']['value'],
                symptoms=symptoms
            )
        return None

class Plant(BaseModel):
    plant_id: str = Field(..., description="Plant identifier")
    scientific_name: str = Field(..., description="Scientific name")
    common_name: str = Field(..., description="Common name")
    diseases: List[Disease] = Field(default=[], description="Associated diseases")
    
    def add_disease_from_graphdb(self, disease_uri: str, db_connector: GraphDBConnector):
        """Add disease information from GraphDB"""
        disease = Disease.from_graphdb(disease_uri, db_connector)
        if disease:
            self.diseases.append(disease)

2. LLM Integration with Structured Data

from openai import OpenAI
import json

class OntologyLLMIntegration:
    def __init__(self, db_connector: GraphDBConnector, openai_client: OpenAI):
        self.db = db_connector
        self.llm = openai_client
    
    def diagnose_plant_disease(self, plant_description: str) -> Plant:
        """Use LLM to diagnose plant disease with ontology constraints"""
        
        # Get available diseases from ontology
        available_diseases = self.db.query("""
            PREFIX disease: <http://example.org/diseases/>
            PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
            
            SELECT ?disease ?name WHERE {
                ?disease rdf:type disease:Disease .
                ?disease rdfs:label ?name .
            }
        """)
        
        # Create prompt with ontology constraints
        disease_options = [d['name']['value'] for d in available_diseases]
        
        prompt = f"""
        Based on this plant description: "{plant_description}"
        
        Available diseases in our ontology: {disease_options}
        
        Please identify the most likely disease and return a JSON object matching this schema:
        {{
            "plant_id": "generated_id",
            "scientific_name": "species name if identifiable",
            "common_name": "common name if identifiable", 
            "diseases": [{{
                "disease_id": "http://example.org/diseases/DiseaseName",
                "name": "disease_name",
                "type": "http://example.org/diseases/DiseaseType",
                "symptoms": [{{
                    "name": "symptom_name",
                    "severity": severity_1_to_10,
                    "location": "affected_location"
                }}],
                "treatment": "recommended_treatment"
            }}]
        }}
        """
        
        response = self.llm.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            response_format={"type": "json_object"}
        )
        
        # Parse and validate with Pydantic
        try:
            plant_data = json.loads(response.choices[0].message.content)
            plant = Plant(**plant_data)
            return plant
        except Exception as e:
            print(f"Error parsing LLM response: {e}")
            return None

# Usage
db = GraphDBConnector()
llm_client = OpenAI()
integrator = OntologyLLMIntegration(db, llm_client)

# Diagnose plant
plant = integrator.diagnose_plant_disease(
    "My tomato plant has yellow spots on leaves and wilting stems"
)

if plant:
    print(f"Diagnosed plant: {plant.common_name}")
    for disease in plant.diseases:
        print(f"Disease: {disease.name}")
        print(f"Treatment: {disease.treatment}")

Performance Optimization

1. Indexing

# Create custom indices for frequent queries
PREFIX plant: <http://example.org/plants/>

# This query pattern should have an index on hasDisease predicate
SELECT ?plant ?disease WHERE {
    ?plant plant:hasDisease ?disease .
}

2. Query Optimization

# Optimized query structure
SELECT ?plant ?disease ?symptom WHERE {
    # Most selective triple first
    ?disease rdf:type :FungalDisease .
    ?plant :hasDisease ?disease .
    ?disease :hasSymptom ?symptom .
}
# Instead of starting with ?plant rdf:type :Plant (less selective)

3. Repository Tuning

# GraphDB repository settings for better performance
entity.pool.size=200000
entity.index.size=10000000
query.timeout=60

Monitoring & Maintenance

1. Health Checks

def check_graphdb_health():
    """Monitor GraphDB status"""
    try:
        response = requests.get("http://localhost:7200/rest/monitor/infrastructure")
        if response.status_code == 200:
            print("✅ GraphDB is healthy")
            return True
    except:
        print("❌ GraphDB is not responding")
        return False

def check_repository_status(repo_name):
    """Check specific repository"""
    response = requests.get(f"http://localhost:7200/rest/repositories/{repo_name}/size")
    if response.status_code == 200:
        size = response.json()
        print(f"Repository {repo_name}: {size} triples")
        return True
    return False

2. Backup & Recovery

# Backup repository
curl -X POST "http://localhost:7200/rest/recovery/backup" \
     -H "Content-Type: application/json" \
     -d '{"repository": "plant-ontology", "backupName": "daily-backup"}'

# List backups
curl "http://localhost:7200/rest/recovery/backup"

# Restore from backup
curl -X POST "http://localhost:7200/rest/recovery/restore" \
     -H "Content-Type: application/json" \
     -d '{"repository": "plant-ontology", "backupName": "daily-backup"}'

Next Steps

Setup GraphDB: Follow the Docker installation guide
Import Ontologies: Upload your first ontology files
Practice SPARQL: Start with basic queries
Python Integration: Build your first ontology-driven application
LLM Integration: Explore structured data extraction

Troubleshooting

Common Issues

Connection refused:

Check if Docker container is running
Verify port 7200 is not blocked

Out of memory:

Increase Docker memory limits
Tune GDB_JAVA_OPTS in docker-compose.yml

Import failures:

Check ontology file format
Validate RDF/XML syntax
Review error logs in GraphDB workbench

GraphDB provides the robust foundation needed for scalable ontology applications. Combined with Pydantic models and LLM integration, it enables powerful semantic AI systems.