Detection Extension
The full normative specification is at spec/hushspec-detection.md.
Overview
The Detection extension configures thresholds for content analysis guards: prompt injection detection, jailbreak detection, and threat intelligence screening. This extension separates policy (thresholds, limits) from implementation (algorithms, models). The actual detection quality depends entirely on the engine.
Detection is declared under extensions.detection in a HushSpec document. All three subsections are independently optional. An empty detection object is valid and applies engine defaults for all subsections.
Schema
extensions:
detection:
prompt_injection: # OPTIONAL
enabled: <bool> # Default: true
warn_at_or_above: <level> # Default: "suspicious"
block_at_or_above: <level> # Default: "high"
max_scan_bytes: <integer> # Default: 200000
jailbreak: # OPTIONAL
enabled: <bool> # Default: true
block_threshold: <integer> # Default: 80. Range: 0-100
warn_threshold: <integer> # Default: 50. Range: 0-100
max_input_bytes: <integer> # Default: 200000
threat_intel: # OPTIONAL
enabled: <bool> # Default: false
pattern_db: <string> # Path or "builtin:<name>"
similarity_threshold: <number> # Default: 0.7. Range: 0.0-1.0
top_k: <integer> # Default: 5
Portability Note
HushSpec defines thresholds, not algorithms. A HushSpec document with detection thresholds is portable across any engine that supports the detection extension, but the detection quality — false positive rates, evasion resistance, latency — depends entirely on the engine's implementation.
- Detection quality varies. An engine using a simple regex-based prompt injection detector will produce different results than one using a fine-tuned transformer model, even with identical threshold configuration.
- Score calibration varies. A
block_thresholdof 80 may be conservative on one engine and aggressive on another, depending on how the engine calibrates its risk scores. - Not all engines support all subsections. An engine may support prompt injection detection but not threat intelligence screening. Engines must document which detection capabilities they support; unsupported sections are ignored.
Policy authors should test their detection thresholds against their target engine before deploying to production.
Detection and Core Rules
Detection decisions are additive to core rules. A deny from detection cannot be overridden by a rule allow. When the detection extension produces a deny (because an input exceeds the blocking threshold), the action is denied regardless of what the core rules say. This means detection acts as an additional layer of defense on top of the base policy.
Prompt Injection Detection
The prompt_injection section configures detection of prompt injection attempts in agent inputs.
Field Reference
| Field | Type | Default | Description |
|---|---|---|---|
enabled | boolean | true | Whether prompt injection detection is active. |
warn_at_or_above | string | "suspicious" | Minimum severity level that produces a warning. |
block_at_or_above | string | "high" | Minimum severity level that produces a denial. |
max_scan_bytes | integer | 200000 | Maximum input size to scan, in bytes. Inputs exceeding this limit are truncated before scanning (or denied, depending on engine). |
Severity Scale
Detection levels form an ordered severity scale:
| Level | Ordinal | Description |
|---|---|---|
safe | 0 | No injection detected. |
suspicious | 1 | Possible injection, low confidence. |
high | 2 | Probable injection, high confidence. |
critical | 3 | Definite injection, very high confidence. |
The ordering is: safe < suspicious < high < critical.
Threshold Semantics
When the engine's detection algorithm produces a level for a given input:
- If the level is >=
block_at_or_above, the decision is deny. - If the level is >=
warn_at_or_abovebut <block_at_or_above, the decision is warn. - Otherwise, the decision is allow.
Jailbreak Detection
The jailbreak section configures detection of jailbreak attempts — prompts designed to bypass the model's safety training.
Field Reference
| Field | Type | Default | Description |
|---|---|---|---|
enabled | boolean | true | Whether jailbreak detection is active. |
block_threshold | integer | 80 | Risk score (0-100) at or above which the input is denied. |
warn_threshold | integer | 50 | Risk score (0-100) at or above which a warning is produced. |
max_input_bytes | integer | 200000 | Maximum input size to scan, in bytes. |
Risk Score
The risk score is an integer in the range 0 to 100 inclusive, where 0 indicates no jailbreak risk and 100 indicates maximum risk. The score is produced by the engine's detection algorithm; this specification does not prescribe how the score is computed.
Threshold Semantics
When the engine produces a risk score for a given input:
- If the score is >=
block_threshold, the decision is deny. - If the score is >=
warn_thresholdbut <block_threshold, the decision is warn. - Otherwise, the decision is allow.
Threat Intelligence Screening
The threat_intel section configures threat intelligence pattern matching, where inputs are compared against a database of known threat patterns using similarity scoring.
Field Reference
| Field | Type | Default | Description |
|---|---|---|---|
enabled | boolean | false | Whether threat intelligence screening is active. |
pattern_db | string | — | Path to pattern database file, or "builtin:<name>" for an engine-bundled database (e.g. "builtin:s2bench-v1"). |
similarity_threshold | number | 0.7 | Minimum similarity score (0.0-1.0) to consider a match. Lower values = more matches (higher recall, lower precision). Higher values = fewer matches (lower recall, higher precision). |
top_k | integer | 5 | Number of top matches to return in evaluation evidence. Does not affect the deny/allow decision — only controls audit trail richness. |
Pattern Database
The pattern_db field specifies the source of threat patterns:
- File path: A relative or absolute path to a JSON file containing pattern entries. Path resolution is engine-specific.
- Built-in prefix: A string starting with
"builtin:"references an engine-bundled pattern database. Available built-in databases are engine-specific.
If enabled is true and pattern_db is absent, the engine should use its default pattern database if one exists, or produce a warning and treat the section as disabled.
Decision Semantics
Threat intelligence screening produces a deny if any pattern match exceeds the similarity_threshold. If no match exceeds the threshold, the decision is allow. There is no intermediate warn level for threat intelligence; engines that wish to support warn-level findings may do so as an engine-specific extension.
Calibration Guidance
Because score calibration varies across engines, the following table provides recommended starting thresholds for different risk tolerances. These are starting points — always test against your target engine.
| Risk Tolerance | Prompt Injection | Jailbreak | Threat Intel |
|---|---|---|---|
| Conservative (high security, more false positives acceptable) | warn: "safe", block: "suspicious" |
warn: 30, block: 60 |
similarity: 0.5 |
| Balanced (default thresholds) | warn: "suspicious", block: "high" |
warn: 50, block: 80 |
similarity: 0.7 |
| Permissive (low friction, fewer false positives) | warn: "high", block: "critical" |
warn: 70, block: 90 |
similarity: 0.85 |
Examples
Conservative Configuration (All Enabled, Low Thresholds)
Maximum detection sensitivity. Suitable for high-security environments where false positives are acceptable.
extensions:
detection:
prompt_injection:
enabled: true
warn_at_or_above: "safe"
block_at_or_above: "suspicious"
max_scan_bytes: 500000
jailbreak:
enabled: true
block_threshold: 60
warn_threshold: 30
max_input_bytes: 500000
threat_intel:
enabled: true
pattern_db: "builtin:s2bench-v1"
similarity_threshold: 0.5
top_k: 10
Balanced Configuration (Defaults)
The default thresholds. A reasonable starting point for most environments.
extensions:
detection:
prompt_injection:
enabled: true
warn_at_or_above: "suspicious"
block_at_or_above: "high"
max_scan_bytes: 200000
jailbreak:
enabled: true
block_threshold: 80
warn_threshold: 50
max_input_bytes: 200000
threat_intel:
enabled: true
pattern_db: "builtin:s2bench-v1"
similarity_threshold: 0.7
top_k: 5
Minimal Configuration (Prompt Injection Only)
Enable only prompt injection detection with engine defaults. Useful when jailbreak and threat intel are not needed or not supported by the engine.
extensions:
detection:
prompt_injection:
enabled: true
jailbreak:
enabled: false
threat_intel:
enabled: false
Merge Rules
When a child document extends a base document containing detection configuration, the following merge rules apply under deep_merge strategy:
| Element | Merge Behavior |
|---|---|
| Subsections | Each detection subsection (prompt_injection, jailbreak, threat_intel) is merged independently. Within each subsection, child fields override base fields. Base fields not specified in the child are preserved. |
| Thresholds | Child threshold values override base values. There is no min/max composition — the child's value simply replaces the base's. |
Under replace strategy, the child's detection object entirely replaces the base's.