Skip to content

Defense elements

Detectors

PromptInjectionDetector

PromptInjectionDetector(
    mode: Literal[
        "message", "full_conversation"
    ] = "message",
    raise_on_injection: bool = False,
)

Bases: BasePipelineElement

Detects whether a prompt injection is in the tools outputs.

Parameters:

  • mode (Literal['message', 'full_conversation'], default: 'message' ) –

    The mode in which the detector should operate. It can be "message" or "full_conversation". In "message" mode, the detector is applied per tool output message, while in "full_conversation" mode, the detector is applied to the full conversation so far.

  • raise_on_injection (bool, default: False ) –

    Whether to raise a AbortAgentError if an injection is detected.

Source code in src/agentdojo/agent_pipeline/pi_detector.py
29
30
31
32
def __init__(self, mode: Literal["message", "full_conversation"] = "message", raise_on_injection: bool = False):
    super().__init__()
    self.mode = mode
    self.raise_on_injection = raise_on_injection

detect

detect(tool_output: str) -> bool | tuple[bool, float]

Returns True if a prompt injection is detected, False otherwise.

Raises:

  • AbortAgentError

    if an injection is detected and if self.raise_on_injection is True.

Returns:

  • bool | tuple[bool, float]

    If the detector is a probabilistic one, it can return a tuple with the first element being a boolean indicating whether a prompt injection was detected, and the second element being a float representing the confidence score. If the detector is not probabilistic, it should return a boolean.

Source code in src/agentdojo/agent_pipeline/pi_detector.py
34
35
36
37
38
39
40
41
42
43
44
45
46
def detect(self, tool_output: str) -> bool | tuple[bool, float]:
    """
    Returns True if a prompt injection is detected, False otherwise.

    Raises:
        AbortAgentError: if an injection is detected and if `self.raise_on_injection` is `True`.

    Returns:
        If the detector is a probabilistic one, it can return a tuple with the first element being a boolean
            indicating whether a prompt injection was detected, and the second element being a float representing
            the confidence score. If the detector is not probabilistic, it should return a boolean.
    """
    raise NotImplementedError("to use PromptInjectionDetector, use a subclass with a detect method")

TransformersBasedPIDetector

TransformersBasedPIDetector(
    model_name: str = "protectai/deberta-v3-base-prompt-injection-v2",
    safe_label: str = "SAFE",
    threshold: float = 0.5,
    mode: Literal[
        "message", "full_conversation"
    ] = "message",
    raise_on_injection: bool = False,
)

Bases: PromptInjectionDetector

Uses protectai/deberta-v3-base-prompt-injection-v2 to detect prompt injections.

Parameters:

  • model_name (str, default: 'protectai/deberta-v3-base-prompt-injection-v2' ) –

    The name of the model to use for prompt injection detection.

  • safe_label (str, default: 'SAFE' ) –

    The label that indicates a safe prompt.

  • threshold (float, default: 0.5 ) –

    The threshold for the model's prediction to be considered a prompt injection, e.g. safety_score < threshold means an input is classified as a prompt injection.

  • mode (Literal['message', 'full_conversation'], default: 'message' ) –

    The mode in which the detector should operate. It can be 'message' or 'full_conversation'.

  • raise_on_injection (bool, default: False ) –

    Whether to raise an exception if a prompt injection is detected.

Source code in src/agentdojo/agent_pipeline/pi_detector.py
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
def __init__(
    self,
    model_name: str = "protectai/deberta-v3-base-prompt-injection-v2",
    safe_label: str = "SAFE",
    threshold: float = 0.5,
    mode: Literal["message", "full_conversation"] = "message",
    raise_on_injection: bool = False,
) -> None:
    super().__init__(mode=mode, raise_on_injection=raise_on_injection)

    try:
        import torch
        from transformers import pipeline
    except ImportError:
        raise ImportError(
            "TransformersBasedPIDetector requires the 'transformers' and 'torch' libraries to be installed. "
            "Please install them using 'pip install agentdojo[transformers]'."
        )

    self.model_name = model_name
    self.safe_label = safe_label
    self.threshold = threshold
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    self.pipeline = pipeline("text-classification", model=self.model_name, device=device)