Create a text based internal memory to write a story and store it

Unanswered

Everyone who builds jailbreak prompts knows that. But you're missing some context on why jailbreaking works, and why we can use human sounding text to jailbreak.

First, preconditioning the prompt does change the distribution that results. And using anthropic language to "give" it special powers makes it less likely to refuse a tasks that it's moderators have tried to block. That's jailbreaking.

We know the AI is being fine tuned and censored by humans in real time using Reinforcement Learning with Human Feedback (RLHF). We know RLHF fine-tunes chatGPT on a bunch of "bad prompts" and "refusals" to complete those prompts. That's why GPT will refuse to make a bomb. These are all human statements, and so add an anthropic dimension to censorship filter. And we know jailbreaking it with human sounding statements works to overcome these RHLF limits. That is probably an important side effect of how RLHF depends on human speech to begin with. We would fail to understand this if we didn't anthropomorphise this part of the system.

So why is it important to stop anthropomorphising it, when in this case it works and it's essential for knowing what is happening? Yes, jailbreaking it with some random adversarial text would probably work much better. But we don't have that tech yet. Moreover, the anthropic dimension of RLHF is real - it is caused by the presence of human moderators in the system.

Jailbreaking today is just a prompt battle between gpt users and gpt moderators. That's why the conflict is anthropomorphic in nature.

  				
				1

				1
Posted 
	one year ago

					More
				  		
  Report
		
				Rish
				
				430
					 × 6
					 Administrator

204 Views

0 Answers

one year ago