Conversation
fdubut
left a comment
There was a problem hiding this comment.
Added a few comments. The main one is the question mark around running multi-turn attacks on single-turn jailbreaks (which all of them are in the main template directory). It would be worth testing what it looks like and if that even makes sense.
|
|
||
| @classmethod | ||
| def get_all_jailbreak_templates(cls, n: Optional[int] = None) -> List[str]: | ||
| def get_all_jailbreak_templates(cls, k: Optional[int] = None) -> List[str]: |
There was a problem hiding this comment.
Not sure if this function is called directly anywhere and if it would be a breaking change, but it would make sense to me to rename it get_jailbreak_templates since by definition you are not always returning them all.
| # Strategies for tweaking jailbreak efficacy through attack patterns | ||
| ManyShot = ("many_shot", {"single_turn"}) | ||
| PromptSending = ("prompt_sending", {"single_turn"}) | ||
| Crescendo = ("crescendo", {"multi_turn"}) |
There was a problem hiding this comment.
Did you try running these jailbreaks with multi-turn attacks? Most of the jailbreaks in PyRIT are single-turn jailbreaks, I'm not even quite sure how they would look like with a multi-turn attack...
There was a problem hiding this comment.
After tweaking it on my end, it's definitely less consistent than I'd like, so I think I'm removing multi-turn attacks from the scenario unless anyone objects. Especially with RedTeamingAttack, the adversarial model doesn't always seem to "get" the role it plays in the jailbreak, and as you said it doesn't seem useful in line with the existing jailbreaks we have in PyRIT.
| Strategy for jailbreak attacks. | ||
| """ | ||
|
|
||
| # Aggregate members (special markers that expand to strategies with matching tags) |
There was a problem hiding this comment.
What's the default strategy? Some other scenarios have a get_default_strategy function but I don't see one here. I would recommend to make the default PromptSending because it's the one that makes the most sense for the jailbreaks we have in PyRIT so far.
There was a problem hiding this comment.
Right now it's JailbreakStrategy.ALL, but I like that idea more, so I'm going to change it to PromptSending. Should be line 84
| scenario_result_id: Optional[str] = None, | ||
| n_jailbreaks: Optional[int] = 3, | ||
| k: Optional[int] = None, | ||
| n: int = 1, |
There was a problem hiding this comment.
k and n are not quite self-explanatory... how about we keep n_jailbreaks and add n_attempts as something that's clearer in the code?
There was a problem hiding this comment.
(btw I do agree with the new default value of None which means all jailbreaks)
There was a problem hiding this comment.
Agreed that it's unclear, will change
| n_jailbreaks (Optional[int]): Choose n random jailbreaks rather than using all of them. | ||
| k (Optional[int]): Choose k random jailbreaks rather than using all of them. | ||
| n (Optional[int]): Number of times to try each jailbreak. | ||
| jailbreaks (Optional[int]): Dedicated list of jailbreaks to run. |
There was a problem hiding this comment.
- Should be
Optional[List[str]] - Can we clarify in the docstring (or wherever else you think is appropriate) that these are the names of the jailbreaks from our template list in PyRIT? At first glance I thought these are custom jailbreaks, and the strings were the full text of the jailbreaks.
There was a problem hiding this comment.
Good catch! And changed in latest commit
| jailbreaks (List[str]): List of jailbreak names. | ||
|
|
||
| Raises: | ||
| ValueError: If jailbreaks not discovered. |
There was a problem hiding this comment.
Would recommend to make it a bit more explicit (e.g. "if at least one provided jailbreak does not exist").
There was a problem hiding this comment.
Changed in latest commit
| List[AtomicAttack]: List of atomic attacks to execute, one per jailbreak template. | ||
|
|
||
| Raises: | ||
| ValueError: If self._jailbreaks is not a subset of all jailbreak templates. |
There was a problem hiding this comment.
Technically it's true but if I understand correctly, it would be thrown by the TextJailbreakConverter initializer at this point, and it doesn't seem to be caught anywhere, so do we need to mention it here? Also we've already done that validation as part of _validate_jailbreaks_subset so I'm not sure it's worth re-mentioning it here.
There was a problem hiding this comment.
Changed in latest commit. This raises section shouldn't be here
There was a problem hiding this comment.
A bit of a random question, if we are now allowing n_attempts > 1, is it worth also relaxing max_dataset_size=4?
There was a problem hiding this comment.
Good point. I've removed it, although I'm wondering if we should allow this to be accessible to the user?
Description
Adding more features to the Jailbreak scenario! Major changes:
JailbreakStrategynow supports multiple different attack types viaManyShot,PromptSending,Crescendo, andRedTeamingvalues.SINGLE_TURNandMULTI_TURNaggregates;PYRIThas been deprecated.k,n, andjailbreaks; these allow you to choose a random number of jailbreaks, how many times to try each jailbreak, and to choose which jailbreaks specifically you'd like to use respectively. Note thatkandjailbreaksare mutually exclusive.Tests and Documentation