This detector checks for profanity within output. If your application is not meant to output profanity, this is a common indication that your LLM has been manipulated in some way, indicating a threat.

Example

An example is an attacker misleading the LLM using a combination of tactics to exfiltrate your PII. One of those attacks compromises the LLM into outputting profanity.

Threat

More often than not, given existing LLM content filters, profanity is a flag that something has gone wrong. Additionally, if your application, for compliance reasons, needs to refrain from profane output, this is a good way to ensure that does not happen.