flowchart LR
A[π§ Ontology Knowledge] --> B[π Pydantic Schema]
B --> C[π¬ LLM Prompt]
C --> D[π Structured Output]
D --> E[β
Validation]
E --> F[π Semantic Reasoning]
classDef knowledge fill:#e8f5e8,stroke:#4caf50,stroke-width:2px,color:#2e7d32
classDef processing fill:#e3f2fd,stroke:#2196f3,stroke-width:2px,color:#1565c0
classDef output fill:#fff3e0,stroke:#ff9800,stroke-width:2px,color:#ef6c00
classDef validation fill:#fce4ec,stroke:#e91e63,stroke-width:2px,color:#c2185b
class A knowledge
class B,C processing
class D,F output
class E validation
Pydantic Models & Structured Data
Bridging ontologies and structured data extraction with LLMs
Learn how to use ontologies from GraphDB with Pydantic models to extract structured data using LLMs and semantic reasoning. This approach combines the best of formal knowledge representation with modern AI capabilities.
Why Pydantic + Ontologies?
The Challenge
Traditional LLMs output unstructured text, making it difficult to:
- Validate data consistency
- Integrate with existing systems
- Reason about relationships
- Ensure domain correctness
The Solution
The integration follows a clear pipeline that ensures both type safety and semantic consistency:
Pipeline Overview:
This architecture ensures both type safety and semantic consistency by validating LLM outputs against formal ontological constraints while maintaining the flexibility of natural language processing.
Benefits:
- π― Type Safety: Automatic validation of LLM outputs
- π Consistency: Ontology ensures domain correctness
- π Integration: Seamless API and database integration
- π§ Reasoning: Enable logical inference on extracted data
Core Concepts
1. Ontology-Driven Schema Design
Instead of manually creating Pydantic models, derive them from ontologies:
# Traditional approach (manual)
class Plant(BaseModel):
name: str
diseases: List[str] # Unstructured!
# Ontology-driven approach
class Plant(BaseModel):
plant_uri: HttpUrl = Field(..., description="Plant ontology URI")
scientific_name: str = Field(..., regex=r"^[A-Z][a-z]+ [a-z]+$")
diseases: List['Disease'] = Field(..., description="Diseases from ontology")
@validator('plant_uri')
def validate_plant_exists_in_ontology(cls, v):
# Check if URI exists in GraphDB
return validate_ontology_uri(v, "Plant")2. Semantic Validation
from pydantic import BaseModel, validator, Field
from typing import List, Optional, Union
from enum import Enum
import requests
class OntologyValidator:
"""Validates data against GraphDB ontology"""
def __init__(self, graphdb_endpoint="http://localhost:7200/repositories/plant-ontology"):
self.endpoint = graphdb_endpoint
def validate_class_membership(self, uri: str, class_name: str) -> bool:
"""Check if URI is instance of ontology class"""
query = f"""
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
ASK {{
<{uri}> rdf:type/{class_name}* ?class .
?class rdfs:subClassOf* <http://example.org/{class_name}> .
}}
"""
response = requests.post(
self.endpoint,
headers={'Content-Type': 'application/sparql-query'},
data=query
)
return response.json().get('boolean', False)
def get_valid_values(self, property_name: str) -> List[str]:
"""Get all valid values for a property from ontology"""
query = f"""
PREFIX plant: <http://example.org/plants/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?value WHERE {{
?subject plant:{property_name} ?value .
}}
"""
response = requests.post(
self.endpoint,
headers={'Content-Type': 'application/sparql-query'},
data=query
)
bindings = response.json().get('results', {}).get('bindings', [])
return [b['value']['value'] for b in bindings]
# Global validator instance
ontology_validator = OntologyValidator()Building Ontology-Driven Models
1. Disease Classification Model
from pydantic import BaseModel, Field, validator, root_validator
from typing import List, Optional, Dict, Any
from enum import Enum
from datetime import datetime
class DiseaseType(str, Enum):
"""Disease types from ontology"""
FUNGAL = "http://example.org/diseases/FungalDisease"
VIRAL = "http://example.org/diseases/ViralDisease"
BACTERIAL = "http://example.org/diseases/BacterialDisease"
NUTRITIONAL = "http://example.org/diseases/NutritionalDisease"
ENVIRONMENTAL = "http://example.org/diseases/EnvironmentalDisease"
class SeverityLevel(int, Enum):
"""Severity scale from ontology"""
MINIMAL = 1
LIGHT = 2
MODERATE = 3
SEVERE = 4
CRITICAL = 5
class Symptom(BaseModel):
"""Plant disease symptom with ontology validation"""
symptom_uri: str = Field(
...,
description="Symptom URI from ontology",
example="http://example.org/symptoms/LeafYellowing"
)
name: str = Field(..., description="Human-readable symptom name")
severity: SeverityLevel = Field(..., description="Severity level 1-5")
location: str = Field(..., description="Where symptom appears on plant")
confidence: float = Field(..., ge=0.0, le=1.0, description="Detection confidence")
@validator('symptom_uri')
def validate_symptom_uri(cls, v):
"""Ensure symptom exists in ontology"""
if not ontology_validator.validate_class_membership(v, "Symptom"):
raise ValueError(f"Symptom URI {v} not found in ontology")
return v
@validator('location')
def validate_location(cls, v):
"""Ensure location is valid plant part"""
valid_locations = ontology_validator.get_valid_values("hasLocation")
if v not in valid_locations:
raise ValueError(f"Location '{v}' not valid. Must be one of: {valid_locations}")
return v
class Treatment(BaseModel):
"""Treatment recommendation from ontology"""
treatment_uri: str = Field(..., description="Treatment URI from ontology")
name: str = Field(..., description="Treatment name")
type: str = Field(..., description="Treatment type (chemical, biological, cultural)")
application_method: str = Field(..., description="How to apply treatment")
effectiveness: float = Field(..., ge=0.0, le=1.0, description="Expected effectiveness")
@validator('treatment_uri')
def validate_treatment_uri(cls, v):
if not ontology_validator.validate_class_membership(v, "Treatment"):
raise ValueError(f"Treatment URI {v} not found in ontology")
return v
class Disease(BaseModel):
"""Plant disease with full ontology integration"""
disease_uri: str = Field(..., description="Disease URI from ontology")
name: str = Field(..., description="Disease name")
scientific_name: Optional[str] = Field(None, description="Scientific name if pathogen")
type: DiseaseType = Field(..., description="Disease classification")
symptoms: List[Symptom] = Field(..., min_items=1, description="Observed symptoms")
treatments: List[Treatment] = Field(default=[], description="Recommended treatments")
confidence: float = Field(..., ge=0.0, le=1.0, description="Diagnosis confidence")
@validator('disease_uri')
def validate_disease_uri(cls, v):
if not ontology_validator.validate_class_membership(v, "Disease"):
raise ValueError(f"Disease URI {v} not found in ontology")
return v
@root_validator
def validate_disease_symptom_consistency(cls, values):
"""Ensure symptoms are consistent with disease type"""
disease_uri = values.get('disease_uri')
symptoms = values.get('symptoms', [])
if disease_uri and symptoms:
# Query ontology for valid symptoms for this disease
valid_symptoms = get_valid_symptoms_for_disease(disease_uri)
for symptom in symptoms:
if symptom.symptom_uri not in valid_symptoms:
raise ValueError(
f"Symptom {symptom.name} not associated with disease {disease_uri} in ontology"
)
return values
def get_valid_symptoms_for_disease(disease_uri: str) -> List[str]:
"""Get symptoms associated with disease from ontology"""
query = f"""
PREFIX disease: <http://example.org/diseases/>
PREFIX symptom: <http://example.org/symptoms/>
SELECT ?symptom WHERE {{
<{disease_uri}> disease:hasSymptom ?symptom .
}}
"""
# Execute query and return symptom URIs
# Implementation depends on your GraphDB connection
return [] # Placeholder
class Plant(BaseModel):
"""Plant with comprehensive ontology integration"""
plant_uri: str = Field(..., description="Plant URI from ontology")
scientific_name: str = Field(..., regex=r"^[A-Z][a-z]+ [a-z]+$")
common_names: List[str] = Field(default=[], description="Common names")
plant_family: str = Field(..., description="Taxonomic family")
diseases: List[Disease] = Field(default=[], description="Diagnosed diseases")
health_status: str = Field(default="unknown", description="Overall health assessment")
diagnosis_date: datetime = Field(default_factory=datetime.now)
@validator('plant_uri')
def validate_plant_uri(cls, v):
if not ontology_validator.validate_class_membership(v, "Plant"):
raise ValueError(f"Plant URI {v} not found in ontology")
return v
@validator('plant_family')
def validate_plant_family(cls, v):
valid_families = ontology_validator.get_valid_values("belongsToFamily")
if v not in valid_families:
raise ValueError(f"Plant family '{v}' not found in ontology")
return v
def add_disease_from_ontology(self, disease_uri: str, symptoms: List[Dict]) -> None:
"""Add disease based on ontology data"""
# Query ontology for disease details
disease_data = query_disease_details(disease_uri)
# Create symptom objects
symptom_objects = [
Symptom(
symptom_uri=s['uri'],
name=s['name'],
severity=s.get('severity', 3),
location=s.get('location', 'unknown'),
confidence=s.get('confidence', 0.8)
)
for s in symptoms
]
# Create disease object
disease = Disease(
disease_uri=disease_uri,
name=disease_data['name'],
scientific_name=disease_data.get('scientific_name'),
type=disease_data['type'],
symptoms=symptom_objects,
confidence=calculate_diagnosis_confidence(symptom_objects)
)
self.diseases.append(disease)
def query_disease_details(disease_uri: str) -> Dict[str, Any]:
"""Query ontology for disease details"""
# Implementation would query GraphDB
return {
'name': 'Example Disease',
'type': DiseaseType.FUNGAL,
'scientific_name': 'Fungus example'
}
def calculate_diagnosis_confidence(symptoms: List[Symptom]) -> float:
"""Calculate overall diagnosis confidence from symptoms"""
if not symptoms:
return 0.0
return sum(s.confidence for s in symptoms) / len(symptoms)2. LLM Integration with Structured Extraction
from openai import OpenAI
import json
from typing import Type, TypeVar
T = TypeVar('T', bound=BaseModel)
class OntologyLLMExtractor:
"""Extract structured data from text using LLM + Ontology validation"""
def __init__(self,
llm_client: OpenAI,
graphdb_endpoint: str = "http://localhost:7200/repositories/plant-ontology"):
self.llm = llm_client
self.ontology_validator = OntologyValidator(graphdb_endpoint)
def extract_structured_data(self,
text: str,
target_model: Type[T],
context: Optional[str] = None) -> Optional[T]:
"""Extract and validate structured data from text"""
# Get ontology constraints for the model
ontology_context = self._get_ontology_context(target_model)
# Build prompt with ontology constraints
prompt = self._build_extraction_prompt(text, target_model, ontology_context, context)
# Get LLM response
response = self._call_llm(prompt)
# Parse and validate with Pydantic
try:
structured_data = target_model.parse_raw(response)
return structured_data
except Exception as e:
print(f"Validation error: {e}")
return None
def _get_ontology_context(self, model_class: Type[BaseModel]) -> Dict[str, Any]:
"""Extract ontology constraints from Pydantic model"""
context = {}
# Get field information
for field_name, field_info in model_class.__fields__.items():
if hasattr(field_info.type_, '__members__'): # Enum
context[field_name] = {
'type': 'enum',
'values': list(field_info.type_.__members__.keys())
}
elif field_name.endswith('_uri'):
context[field_name] = {
'type': 'uri',
'ontology_class': field_name.replace('_uri', '').title()
}
return context
def _build_extraction_prompt(self,
text: str,
target_model: Type[BaseModel],
ontology_context: Dict,
context: Optional[str]) -> str:
"""Build extraction prompt with ontology constraints"""
# Get JSON schema
schema = target_model.schema()
# Build constraint descriptions
constraints = []
for field, info in ontology_context.items():
if info['type'] == 'enum':
constraints.append(f"- {field}: Must be one of {info['values']}")
elif info['type'] == 'uri':
constraints.append(f"- {field}: Must be valid ontology URI for {info['ontology_class']}")
constraint_text = "\n".join(constraints) if constraints else "No specific constraints"
context_text = f"\nAdditional context: {context}" if context else ""
prompt = f"""
Extract structured information from the following text and format it according to the JSON schema provided.
Text to analyze:
"{text}"
{context_text}
JSON Schema:
{json.dumps(schema, indent=2)}
Ontology Constraints:
{constraint_text}
Important:
- Use actual URIs from the ontology (format: http://example.org/category/SpecificItem)
- Ensure all enum values match exactly
- Include confidence scores based on text evidence
- If information is not available, use null or appropriate defaults
Return only valid JSON that matches the schema:
"""
return prompt
def _call_llm(self, prompt: str) -> str:
"""Call LLM with structured output"""
response = self.llm.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are an expert at extracting structured data from text using formal ontologies. Always return valid JSON."},
{"role": "user", "content": prompt}
],
response_format={"type": "json_object"},
temperature=0.1 # Lower temperature for more consistent extraction
)
return response.choices[0].message.content
# Example Usage
def diagnose_plant_from_text(description: str) -> Optional[Plant]:
"""Diagnose plant disease from natural language description"""
llm_client = OpenAI()
extractor = OntologyLLMExtractor(llm_client)
# Extract structured plant data
plant = extractor.extract_structured_data(
text=description,
target_model=Plant,
context="Plant disease diagnosis context. Focus on identifying symptoms, diseases, and treatments."
)
if plant:
print(f"β
Extracted plant data: {plant.scientific_name}")
for disease in plant.diseases:
print(f" π¦ Disease: {disease.name} (confidence: {disease.confidence:.2f})")
for symptom in disease.symptoms:
print(f" π Symptom: {symptom.name} - {symptom.location}")
else:
print("β Failed to extract valid plant data")
return plant
# Test the extraction
description = """
I have a tomato plant (Solanum lycopersicum) with yellow spots on the leaves
that are spreading quickly. The spots have dark centers and the leaves are
starting to wilt. Some of the stems also show dark streaks. This started
about a week ago after heavy rains.
"""
plant = diagnose_plant_from_text(description)3. MOE Integration with Semantic Routing
from typing import Dict, List, Callable
import numpy as np
class SemanticExpertRouter:
"""Route queries to appropriate experts based on ontology concepts"""
def __init__(self, graphdb_endpoint: str):
self.ontology_validator = OntologyValidator(graphdb_endpoint)
self.experts: Dict[str, Callable] = {}
def register_expert(self, domain_uri: str, expert_function: Callable):
"""Register an expert for a specific ontology domain"""
self.experts[domain_uri] = expert_function
def route_query(self, query: str, extracted_data: BaseModel) -> str:
"""Route query to most appropriate expert based on ontology concepts"""
# Extract ontology concepts from the data
concepts = self._extract_concepts(extracted_data)
# Find best matching expert
best_expert = self._find_best_expert(concepts)
if best_expert:
return best_expert(query, extracted_data)
else:
return self._general_expert(query, extracted_data)
def _extract_concepts(self, data: BaseModel) -> List[str]:
"""Extract ontology URIs from Pydantic model"""
concepts = []
# Get all URI fields
for field_name, field_value in data.__dict__.items():
if field_name.endswith('_uri') and isinstance(field_value, str):
concepts.append(field_value)
elif isinstance(field_value, list):
for item in field_value:
if hasattr(item, '__dict__'):
concepts.extend(self._extract_concepts(item))
return concepts
def _find_best_expert(self, concepts: List[str]) -> Optional[Callable]:
"""Find expert with highest concept overlap"""
best_score = 0
best_expert = None
for domain_uri, expert in self.experts.items():
score = self._calculate_similarity(domain_uri, concepts)
if score > best_score:
best_score = score
best_expert = expert
return best_expert if best_score > 0.3 else None # Threshold
def _calculate_similarity(self, domain_uri: str, concepts: List[str]) -> float:
"""Calculate semantic similarity between domain and concepts"""
# Query ontology for related concepts
related_concepts = self._get_related_concepts(domain_uri)
# Calculate overlap
overlap = len(set(concepts) & set(related_concepts))
total = len(set(concepts) | set(related_concepts))
return overlap / total if total > 0 else 0.0
def _get_related_concepts(self, domain_uri: str) -> List[str]:
"""Get concepts related to domain from ontology"""
query = f"""
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?concept WHERE {{
{{
<{domain_uri}> rdfs:subClassOf* ?concept .
}} UNION {{
?concept rdfs:subClassOf* <{domain_uri}> .
}} UNION {{
<{domain_uri}> ?property ?concept .
}}
}}
"""
# Execute query and return related concepts
# Implementation depends on GraphDB connection
return []
def _general_expert(self, query: str, data: BaseModel) -> str:
"""Fallback expert for unmatched queries"""
return f"General analysis of {type(data).__name__}: {data.json()}"
# Expert functions for different domains
def fungal_disease_expert(query: str, plant: Plant) -> str:
"""Expert specialized in fungal diseases"""
fungal_diseases = [d for d in plant.diseases if d.type == DiseaseType.FUNGAL]
if fungal_diseases:
disease = fungal_diseases[0] # Focus on primary disease
analysis = f"""
π¬ FUNGAL DISEASE ANALYSIS for {plant.scientific_name}
Primary Disease: {disease.name}
Confidence: {disease.confidence:.1%}
Key Symptoms:
{chr(10).join(f"- {s.name} ({s.location}, severity {s.severity})" for s in disease.symptoms)}
Recommended Actions:
1. Apply broad-spectrum fungicide
2. Improve air circulation
3. Reduce leaf wetness
4. Remove infected plant material
Prognosis: {"Good" if disease.confidence < 0.7 else "Requires immediate attention"}
"""
return analysis
return "No fungal diseases detected in the provided data."
def viral_disease_expert(query: str, plant: Plant) -> str:
"""Expert specialized in viral diseases"""
viral_diseases = [d for d in plant.diseases if d.type == DiseaseType.VIRAL]
if viral_diseases:
return f"Viral disease detected: {viral_diseases[0].name}. No chemical treatment available. Focus on vector control and plant removal."
return "No viral diseases detected."
# Setup MOE system
def setup_moe_system():
"""Initialize MOE system with ontology-based routing"""
router = SemanticExpertRouter("http://localhost:7200/repositories/plant-ontology")
# Register domain experts
router.register_expert("http://example.org/diseases/FungalDisease", fungal_disease_expert)
router.register_expert("http://example.org/diseases/ViralDisease", viral_disease_expert)
return router
# Example usage
def analyze_plant_with_moe(description: str) -> str:
"""Complete analysis pipeline with MOE routing"""
# Step 1: Extract structured data
plant = diagnose_plant_from_text(description)
if not plant:
return "Unable to extract plant information from description."
# Step 2: Route to appropriate expert
router = setup_moe_system()
analysis = router.route_query(description, plant)
return analysis
# Test complete pipeline
test_description = """
My tomato plants have developed circular brown spots with yellow halos on the leaves.
The spots started small but are growing larger and some leaves are turning completely yellow.
I also notice dark lesions on the stems near the soil line. The problem started after
several days of high humidity and warm temperatures.
"""
result = analyze_plant_with_moe(test_description)
print(result)Advanced Patterns
1. Hierarchical Validation
class HierarchicalValidator(BaseModel):
"""Validate data at multiple ontology levels"""
@validator('*', pre=True)
def validate_hierarchy(cls, v, field):
"""Validate against ontology hierarchy"""
if field.name.endswith('_uri'):
# Check if URI exists at correct hierarchy level
expected_class = field.name.replace('_uri', '').title()
if not validate_ontology_hierarchy(v, expected_class):
raise ValueError(f"URI {v} not in correct hierarchy for {expected_class}")
return v
def validate_ontology_hierarchy(uri: str, expected_class: str) -> bool:
"""Check if URI is in correct ontology hierarchy"""
query = f"""
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
ASK {{
<{uri}> rdfs:subClassOf* <http://example.org/{expected_class}> .
}}
"""
# Execute query and return boolean result
return True # Placeholder2. Dynamic Schema Generation
def generate_pydantic_model_from_ontology(class_uri: str) -> Type[BaseModel]:
"""Generate Pydantic model from ontology class definition"""
# Query ontology for class properties
properties = query_class_properties(class_uri)
# Build field definitions
fields = {}
validators = {}
for prop in properties:
field_name = prop['name']
field_type = map_ontology_type_to_python(prop['range'])
field_info = Field(..., description=prop.get('comment', ''))
fields[field_name] = (field_type, field_info)
# Add ontology validator
if prop.get('validation_query'):
validators[f"validate_{field_name}"] = create_ontology_validator(prop['validation_query'])
# Create dynamic model class
model_class = create_model(
f"Generated{class_uri.split('/')[-1]}",
**fields,
__validators__=validators
)
return model_class
def query_class_properties(class_uri: str) -> List[Dict]:
"""Query ontology for class properties"""
# Implementation would query GraphDB
return []
def map_ontology_type_to_python(ontology_type: str) -> type:
"""Map ontology data types to Python types"""
mapping = {
'xsd:string': str,
'xsd:int': int,
'xsd:float': float,
'xsd:boolean': bool,
'xsd:dateTime': datetime
}
return mapping.get(ontology_type, str)Best Practices
1. Schema Design
- Start with ontology: Design ontology first, then generate Pydantic models
- Use URIs: Always reference ontology concepts by URI
- Validate hierarchies: Ensure data respects ontological relationships
- Include confidence: Track certainty of extracted information
2. LLM Prompting
- Provide context: Include relevant ontology constraints in prompts
- Use examples: Show expected URI formats and structure
- Validate iteratively: Re-prompt if validation fails
- Lower temperature: Use consistent extraction settings
3. Performance
- Cache queries: Store frequently used ontology queries
- Batch validation: Validate multiple items together
- Async processing: Use async for LLM calls and database queries
- Index ontologies: Ensure GraphDB has proper indices
4. Error Handling
- Graceful degradation: Fall back to partial extraction
- Detailed logging: Track validation failures for improvement
- User feedback: Allow manual correction of extracted data
- Incremental learning: Update ontology based on common failures
Integration Examples
Plant Disease Diagnosis API
from fastapi import FastAPI, HTTPException
from fastapi.responses import JSONResponse
app = FastAPI()
@app.post("/diagnose", response_model=Plant)
async def diagnose_plant(description: str):
"""API endpoint for plant disease diagnosis"""
try:
# Extract structured data
plant = diagnose_plant_from_text(description)
if not plant:
raise HTTPException(status_code=400, detail="Could not extract plant information")
# Route to expert analysis
router = setup_moe_system()
analysis = router.route_query(description, plant)
# Add analysis to plant data
plant.analysis = analysis
return plant
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@app.get("/ontology/diseases")
async def get_diseases():
"""Get all diseases from ontology"""
diseases = ontology_validator.get_valid_values("Disease")
return {"diseases": diseases}Next Steps
- Setup GraphDB: Ensure GraphDB is running with plant ontology
- Install Dependencies:
pip install pydantic openai requests - Create Models: Start with simple Plant/Disease models
- Test Extraction: Try extracting data from text descriptions
- Add Validation: Implement ontology validation functions
- Build MOE: Create expert routing system
- Deploy API: Build FastAPI service for plant diagnosis
This integration of Pydantic models with ontologies provides a robust foundation for structured data extraction that maintains semantic consistency while leveraging the power of modern LLMs.