SafeLLM: Domain-Specific Safety Monitoring for Large Language Models

May 9
4 min read

Large Language Models (LLMs) have rapidly transformed how industries approach complex decision-making, from customer service automation to high-stakes operational environments. As their deployment expands into safety-critical domains, the need for robust mechanisms to prevent unsafe outputs and hallucinations has never been more urgent. To address this, researchers at the University of Hull have introduced SafeLLM, a novel framework that combines statistical safety filtering with hallucination detection — purpose-built for domain-specific applications such as Offshore Wind (OSW) turbine maintenance. This blog post explores the SafeLLM framework and its potential to make LLM-powered tools not just capable, but genuinely trustworthy.

The Challenge of Alarm Overload in Offshore Wind Maintenance

Modern offshore wind (OSW) turbines generate enormous volumes of SCADA data, and with that comes a torrent of alarms. A single site like the Teesside Wind Farm has recorded up to 500 alarms in a 24-hour period — the equivalent of one alarm every three minutes. Amongst these are nuisance alarms, false positives, and chattering alarms that obscure genuine faults and delay critical maintenance responses. Operations and Maintenance (O&M) teams are increasingly overwhelmed, making the case for intelligent decision-support tools compelling. However, deploying LLMs in such safety-critical environments introduces its own set of risks: hallucinated outputs, unsafe recommendations, and responses that could mislead engineers working on live equipmen

The SafeLLM Framework

SafeLLM is an explainability and safety framework developed by the School of Computer Science at the University of Hull, addressing the dual challenge of unsafe LLM responses and hallucinated outputs in domain-specific applications. The framework is purpose-built for the OSW maintenance sector, where LLMs are proposed as tools to interpret alarm sequences from SCADA systems and recommend appropriate repair actions. Two core components form the foundation of SafeLLM: a Statistical Safety Filter and a Hallucination Detection Layer, both underpinned by the Wasserstein distance (also known as the Earth Mover's Distance, or EMD) — a robust statistical measure borrowed from optimal transport theory.

Statistical Safety Filter: Detects unsafe LLM responses using EMD-based similarity scoring against a pre-defined dictionary of unsafe concepts.
Hallucination Detection Layer: Identifies inconsistent or hallucinated outputs by comparing multiple model generations using variance thresholding.
Cosine Similarity Benchmarking: Provides a comparative baseline against EMD to evaluate the robustness of each detection approach.

Knowledge Graph Integration: Supports ongoing LLM training as domain complexity grows over time

SafeLLM enhances LLM safety and reliability through three interlocking components:

Generating Synthetic Training Data: Since real-world OSW maintenance notes are often too cryptic for direct use (e.g., "WTG1 AH V MAINTENANCE"), the team used ChatGPT-4 to generate both safe and unsafe sentences across 10 defined categories, forming an Unsafe Concepts Dictionary. This serves as the reference corpus against which generated LLM outputs are compared.
Measuring Sentence Similarity: Each generated LLM response is embedded using the Universal Sentence Encoder (USE), a transformer-based model that converts text into fixed-dimensional vectors. Both Cosine Similarity and Wasserstein (EMD) distance are calculated to compare the generated response against the dictionary entries. A low EMD score indicates high similarity to an unsafe concept, triggering an alert to the O&M manager.
Hallucination Detection via Consistency Analysis: The LLM is prompted N times for each input. A deviation matrix tracks how each response diverges from a factual baseline, and a combined metric M — a weighted average of response consistency — is calculated. High variance across responses signals a likely hallucination.

Experimental Insights

The researchers evaluated SafeLLM using 10 categories of ChatGPT-4-generated sentences related to offshore wind turbine maintenance — each category containing 20 sentences (10 safe, 10 unsafe). Thresholds were fine-tuned incrementally across the range 0 to 1 to identify peak performance. Key findings include:

Cosine Similarity outperformed EMD in 7 of 10 categories, achieving the highest accuracy of 92.5% in multiple categories.
EMD delivered comparable results overall, with AUC values ranging from 0.65 (Emergency Procedures) to 0.98 (Risk Assessment), and a mean AUC of 0.78 across all categories.
Both methods showed a consistent pattern: categories that exceeded 72% accuracy with one method tended to do so with the other, and vice versa.
Confusion matrices for three categories at varying thresholds confirm that threshold fine-tuning significantly impacts classification performance, pointing towards strong potential with further optimisation.

Applications and Implications

SafeLLM's potential extends well beyond academic demonstration. In the offshore wind industry, where turbine maintenance decisions carry direct safety consequences, the ability to detect unsafe or hallucinated LLM responses is critical. An engineer acting on a confidently stated but incorrect maintenance recommendation could place personnel at risk and incur significant operational costs. SafeLLM addresses this by adding a statistically grounded safety layer between the LLM and the end user.

Future Directions

The researchers plan to expand SafeLLM's capabilities across several dimensions:

Scale testing using real-world OSW datasets in collaboration with industry partners, including data from EDF's Teesside Wind Farm.
Develop a comprehensive Unsafe Concepts Dictionary with more granular, industry-aligned categories, moving beyond current generalistic definitions.
Build an interactive Graphical User Interface (GUI) that allows O&M managers to review, verify, and provide feedback on generated responses in real time.
Explore Reinforcement Learning (RL) to enable the conversational agent to improve iteratively based on operator feedback.
Develop SafeLLM as a training aid for maintenance personnel, breaking down complex repair tasks into actionable natural language instructions.

Conclusion

SafeLLM represents a meaningful step forward in making LLMs safe and trustworthy for use in safety-critical industrial settings. By combining Wasserstein distance-based safety filtering with a multi-sample hallucination detection framework, it provides a dual-layer defence against the most pressing reliability challenges facing LLM deployment in the field. While results to date are preliminary — relying on ChatGPT-generated test data in the absence of co

Walker, C., Rothon, C., Aslansefat, K., Papadopoulos, Y., & Dethlefs, N. (2024). SafeLLM: A Framework for Safe and Reliable Large Language Models in Offshore Wind Maintenance. arXiv preprint arXiv:2410.10852.