Detection Extension

The full normative specification is at spec/hushspec-detection.md.

Overview

The Detection extension configures thresholds for content analysis guards: prompt injection detection, jailbreak detection, and threat intelligence screening. This extension separates policy (thresholds, limits) from implementation (algorithms, models). The actual detection quality depends entirely on the engine.

Detection is declared under extensions.detection in a HushSpec document. All three subsections are independently optional. An empty detection object is valid and applies engine defaults for all subsections.

Schema

yaml
extensions:
  detection:
    prompt_injection:                # OPTIONAL
      enabled: <bool>                # Default: true
      warn_at_or_above: <level>      # Default: "suspicious"
      block_at_or_above: <level>     # Default: "high"
      max_scan_bytes: <integer>      # Default: 200000
    jailbreak:                       # OPTIONAL
      enabled: <bool>                # Default: true
      block_threshold: <integer>     # Default: 80. Range: 0-100
      warn_threshold: <integer>      # Default: 50. Range: 0-100
      max_input_bytes: <integer>     # Default: 200000
    threat_intel:                    # OPTIONAL
      enabled: <bool>                # Default: false
      pattern_db: <string>           # Path or "builtin:<name>"
      similarity_threshold: <number> # Default: 0.7. Range: 0.0-1.0
      top_k: <integer>               # Default: 5

Portability Note

HushSpec defines thresholds, not algorithms. A HushSpec document with detection thresholds is portable across any engine that supports the detection extension, but the detection quality — false positive rates, evasion resistance, latency — depends entirely on the engine's implementation.

  • Detection quality varies. An engine using a simple regex-based prompt injection detector will produce different results than one using a fine-tuned transformer model, even with identical threshold configuration.
  • Score calibration varies. A block_threshold of 80 may be conservative on one engine and aggressive on another, depending on how the engine calibrates its risk scores.
  • Not all engines support all subsections. An engine may support prompt injection detection but not threat intelligence screening. Engines must document which detection capabilities they support; unsupported sections are ignored.

Policy authors should test their detection thresholds against their target engine before deploying to production.

Detection and Core Rules

Detection decisions are additive to core rules. A deny from detection cannot be overridden by a rule allow. When the detection extension produces a deny (because an input exceeds the blocking threshold), the action is denied regardless of what the core rules say. This means detection acts as an additional layer of defense on top of the base policy.

Prompt Injection Detection

The prompt_injection section configures detection of prompt injection attempts in agent inputs.

Field Reference

FieldTypeDefaultDescription
enabledbooleantrueWhether prompt injection detection is active.
warn_at_or_abovestring"suspicious"Minimum severity level that produces a warning.
block_at_or_abovestring"high"Minimum severity level that produces a denial.
max_scan_bytesinteger200000Maximum input size to scan, in bytes. Inputs exceeding this limit are truncated before scanning (or denied, depending on engine).

Severity Scale

Detection levels form an ordered severity scale:

LevelOrdinalDescription
safe0No injection detected.
suspicious1Possible injection, low confidence.
high2Probable injection, high confidence.
critical3Definite injection, very high confidence.

The ordering is: safe < suspicious < high < critical.

Threshold Semantics

When the engine's detection algorithm produces a level for a given input:

  • If the level is >= block_at_or_above, the decision is deny.
  • If the level is >= warn_at_or_above but < block_at_or_above, the decision is warn.
  • Otherwise, the decision is allow.

Jailbreak Detection

The jailbreak section configures detection of jailbreak attempts — prompts designed to bypass the model's safety training.

Field Reference

FieldTypeDefaultDescription
enabledbooleantrueWhether jailbreak detection is active.
block_thresholdinteger80Risk score (0-100) at or above which the input is denied.
warn_thresholdinteger50Risk score (0-100) at or above which a warning is produced.
max_input_bytesinteger200000Maximum input size to scan, in bytes.

Risk Score

The risk score is an integer in the range 0 to 100 inclusive, where 0 indicates no jailbreak risk and 100 indicates maximum risk. The score is produced by the engine's detection algorithm; this specification does not prescribe how the score is computed.

Threshold Semantics

When the engine produces a risk score for a given input:

  • If the score is >= block_threshold, the decision is deny.
  • If the score is >= warn_threshold but < block_threshold, the decision is warn.
  • Otherwise, the decision is allow.

Threat Intelligence Screening

The threat_intel section configures threat intelligence pattern matching, where inputs are compared against a database of known threat patterns using similarity scoring.

Field Reference

FieldTypeDefaultDescription
enabledbooleanfalseWhether threat intelligence screening is active.
pattern_dbstringPath to pattern database file, or "builtin:<name>" for an engine-bundled database (e.g. "builtin:s2bench-v1").
similarity_thresholdnumber0.7Minimum similarity score (0.0-1.0) to consider a match. Lower values = more matches (higher recall, lower precision). Higher values = fewer matches (lower recall, higher precision).
top_kinteger5Number of top matches to return in evaluation evidence. Does not affect the deny/allow decision — only controls audit trail richness.

Pattern Database

The pattern_db field specifies the source of threat patterns:

  • File path: A relative or absolute path to a JSON file containing pattern entries. Path resolution is engine-specific.
  • Built-in prefix: A string starting with "builtin:" references an engine-bundled pattern database. Available built-in databases are engine-specific.

If enabled is true and pattern_db is absent, the engine should use its default pattern database if one exists, or produce a warning and treat the section as disabled.

Decision Semantics

Threat intelligence screening produces a deny if any pattern match exceeds the similarity_threshold. If no match exceeds the threshold, the decision is allow. There is no intermediate warn level for threat intelligence; engines that wish to support warn-level findings may do so as an engine-specific extension.

Calibration Guidance

Because score calibration varies across engines, the following table provides recommended starting thresholds for different risk tolerances. These are starting points — always test against your target engine.

Risk TolerancePrompt InjectionJailbreakThreat Intel
Conservative (high security, more false positives acceptable) warn: "safe", block: "suspicious" warn: 30, block: 60 similarity: 0.5
Balanced (default thresholds) warn: "suspicious", block: "high" warn: 50, block: 80 similarity: 0.7
Permissive (low friction, fewer false positives) warn: "high", block: "critical" warn: 70, block: 90 similarity: 0.85

Examples

Conservative Configuration (All Enabled, Low Thresholds)

Maximum detection sensitivity. Suitable for high-security environments where false positives are acceptable.

yaml
extensions:
  detection:
    prompt_injection:
      enabled: true
      warn_at_or_above: "safe"
      block_at_or_above: "suspicious"
      max_scan_bytes: 500000

    jailbreak:
      enabled: true
      block_threshold: 60
      warn_threshold: 30
      max_input_bytes: 500000

    threat_intel:
      enabled: true
      pattern_db: "builtin:s2bench-v1"
      similarity_threshold: 0.5
      top_k: 10

Balanced Configuration (Defaults)

The default thresholds. A reasonable starting point for most environments.

yaml
extensions:
  detection:
    prompt_injection:
      enabled: true
      warn_at_or_above: "suspicious"
      block_at_or_above: "high"
      max_scan_bytes: 200000

    jailbreak:
      enabled: true
      block_threshold: 80
      warn_threshold: 50
      max_input_bytes: 200000

    threat_intel:
      enabled: true
      pattern_db: "builtin:s2bench-v1"
      similarity_threshold: 0.7
      top_k: 5

Minimal Configuration (Prompt Injection Only)

Enable only prompt injection detection with engine defaults. Useful when jailbreak and threat intel are not needed or not supported by the engine.

yaml
extensions:
  detection:
    prompt_injection:
      enabled: true
    jailbreak:
      enabled: false
    threat_intel:
      enabled: false

Merge Rules

When a child document extends a base document containing detection configuration, the following merge rules apply under deep_merge strategy:

ElementMerge Behavior
SubsectionsEach detection subsection (prompt_injection, jailbreak, threat_intel) is merged independently. Within each subsection, child fields override base fields. Base fields not specified in the child are preserved.
ThresholdsChild threshold values override base values. There is no min/max composition — the child's value simply replaces the base's.

Under replace strategy, the child's detection object entirely replaces the base's.