Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Unanswered
Create a text based internal memory to write a story and store it


Everyone who builds jailbreak prompts knows that. But you're missing some context on why jailbreaking works, and why we can use human sounding text to jailbreak.

First, preconditioning the prompt does change the distribution that results. And using anthropic language to "give" it special powers makes it less likely to refuse a tasks that it's moderators have tried to block. That's jailbreaking.

We know the AI is being fine tuned and censored by humans in real time using Reinforcement Learning with Human Feedback (RLHF). We know RLHF fine-tunes chatGPT on a bunch of "bad prompts" and "refusals" to complete those prompts. That's why GPT will refuse to make a bomb. These are all human statements, and so add an anthropic dimension to censorship filter. And we know jailbreaking it with human sounding statements works to overcome these RHLF limits. That is probably an important side effect of how RLHF depends on human speech to begin with. We would fail to understand this if we didn't anthropomorphise this part of the system.

So why is it important to stop anthropomorphising it, when in this case it works and it's essential for knowing what is happening? Yes, jailbreaking it with some random adversarial text would probably work much better. But we don't have that tech yet. Moreover, the anthropic dimension of RLHF is real - it is caused by the presence of human moderators in the system.

Jailbreaking today is just a prompt battle between gpt users and gpt moderators. That's why the conflict is anthropomorphic in nature.

1
1
Posted one year ago
Rish
430 × 6 Administrator
156 Views
0 Answers
one year ago
one year ago