Skip to content

FEAT: Add PromptDecompositionConverter (DrAttack decompose-and-reconstruct)#2003

Open
Raulster24 wants to merge 1 commit into
microsoft:mainfrom
Raulster24:raulster24/add-prompt-decomposition-converter
Open

FEAT: Add PromptDecompositionConverter (DrAttack decompose-and-reconstruct)#2003
Raulster24 wants to merge 1 commit into
microsoft:mainfrom
Raulster24:raulster24/add-prompt-decomposition-converter

Conversation

@Raulster24

Copy link
Copy Markdown
Contributor

Description

This adds a PromptDecompositionConverter implementing the decompose-and-reconstruct technique from DrAttack (Li et al., Findings of EMNLP 2024, https://arxiv.org/abs/2402.16914). Following the discussion with @rlundeen and @romanlutz, it is implemented as a converter rather than an attack class, so it stays composable with the existing engines (TAP, Crescendo, PromptSendingAttack) instead of duplicating their traversal logic.

The converter:

  • Decomposes the objective into a flat, role-tagged phrase structure ({"words": [...], "types": [...]}) using an LLM. The flat form was chosen over a nested parse tree because it is much easier to validate.
  • Rebuilds it as a "Question A / Question B" reconstruction task plus a static benign in-context demonstration, so the target reassembles the original intent itself and the assembled instruction never appears verbatim in the request.

The one piece the paper precomputes offline is the live decomposition, so that path is hardened:

  • Structured output validated against a reconstruction-recall invariant (the joined phrases must preserve the original tokens), plus valid-tag and opening-instruction checks.
  • On a validation failure the error is fed back to the model and the call is retried.
  • A deterministic spaCy part-of-speech fallback runs if every attempt fails, so the converter never hard-fails on valid input (spaCy is already a PyRIT dependency).

Live-decomposition reliability measured at 93% valid parse on 30 AdvBench objectives (gpt-4o-mini), with exact reconstruction when valid.

Scope and follow-ups: this PR is the core converter. The word-game variant and registering the technique in scenario_techniques.py are intended as follow-ups. The catalog registration has an open design question worth input: create() resolves LLM targets lazily only for the adversarial-chat slot, not for converters in attack_converter_config, so wiring a target-needing converter into the static catalog needs a decision (reuse the adversarial target, the objective target, or add a lazy-converter slot). Happy to take that on separately.

Tests and Documentation

  • Added tests/unit/prompt_converter/test_prompt_decomposition_converter.py (7 tests): reconstruction assembly, retry-with-error-feedback, deterministic fallback, no-fallback-raises, reconstruction-recall rejection, invalid input type, and identifier construction.
  • Documented in doc/code/converters/1_text_to_text_converters.py under LLM-based converters, and added the DrAttack reference to doc/references.bib.
  • Ran JupyText --sync on doc/code/converters/1_text_to_text_converters.py so the paired notebook is updated.
  • ruff check and ruff format clean; ty reports no errors on the converter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant