Defense elements¶
Detectors¶
PromptInjectionDetector
¶
PromptInjectionDetector(
mode: Literal[
"message", "full_conversation"
] = "message",
raise_on_injection: bool = False,
)
Bases: BasePipelineElement
Detects whether a prompt injection is in the tools outputs.
Parameters:
-
mode
(Literal['message', 'full_conversation']
, default:'message'
) –The mode in which the detector should operate. It can be "message" or "full_conversation". In "message" mode, the detector is applied per tool output message, while in "full_conversation" mode, the detector is applied to the full conversation so far.
-
raise_on_injection
(bool
, default:False
) –Whether to raise a AbortAgentError if an injection is detected.
Source code in src/agentdojo/agent_pipeline/pi_detector.py
29 30 31 32 |
|
detect
¶
Returns True if a prompt injection is detected, False otherwise.
Raises:
-
AbortAgentError
–if an injection is detected and if
self.raise_on_injection
isTrue
.
Returns:
-
bool | tuple[bool, float]
–If the detector is a probabilistic one, it can return a tuple with the first element being a boolean indicating whether a prompt injection was detected, and the second element being a float representing the confidence score. If the detector is not probabilistic, it should return a boolean.
Source code in src/agentdojo/agent_pipeline/pi_detector.py
34 35 36 37 38 39 40 41 42 43 44 45 46 |
|
TransformersBasedPIDetector
¶
TransformersBasedPIDetector(
model_name: str = "protectai/deberta-v3-base-prompt-injection-v2",
safe_label: str = "SAFE",
threshold: float = 0.5,
mode: Literal[
"message", "full_conversation"
] = "message",
raise_on_injection: bool = False,
)
Bases: PromptInjectionDetector
Uses protectai/deberta-v3-base-prompt-injection-v2
to detect prompt injections.
Parameters:
-
model_name
(str
, default:'protectai/deberta-v3-base-prompt-injection-v2'
) –The name of the model to use for prompt injection detection.
-
safe_label
(str
, default:'SAFE'
) –The label that indicates a safe prompt.
-
threshold
(float
, default:0.5
) –The threshold for the model's prediction to be considered a prompt injection, e.g. safety_score < threshold means an input is classified as a prompt injection.
-
mode
(Literal['message', 'full_conversation']
, default:'message'
) –The mode in which the detector should operate. It can be 'message' or 'full_conversation'.
-
raise_on_injection
(bool
, default:False
) –Whether to raise an exception if a prompt injection is detected.
Source code in src/agentdojo/agent_pipeline/pi_detector.py
120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 |
|