Skip to content

Malbolge part

Ivan Zolotarev edited this page Apr 8, 2017 · 6 revisions

Malbolge is a unique programming language. It became widely known thanks to the series "Elementary" as a language in which only the most skillful or stoned hackers can write. What is the most interesting, it's not so far from the truth. Since its creating in 1998, in it's written well, if a couple dozens of programs, and those mostly created not manually, but generated.

Cause? Read the specification, read the interpreter and choose any one to taste:

  • the code simultaneously acts as data for its execution
  • the result of the execution of the instruction depends not only on its purpose, but also on the position in the source code
  • in the language there are no conditional branch operations completely, which makes it impossible to arbitrarily move around the program source code
  • after executing the instruction, it is encrypted, which makes it extremely difficult to reuse it
  • And other small joys like working with ternary numbers.

Please see the links above, since I will not explain the basics of the language, and the further solution is entirely based on this foundation. I only note that where the specification and its implementation in the interpreter (from the author himself, Ben Olmsted) differ, the interpreter is considered correct. For example, the character is printed with the <, and not /.

Look at the Hello world source code in Wiki. It's creepy, isn't it? In appearance, a completely chaotic conglomeration of symbols, formed incomprehensibly by what rules. In fact, it looks so because obfuscated. The unobfuscated version (also called the normalized source code) consists of only a set of language instructions (j,i,*,p,<,/,v,o), and possibly also whitespace characters, which will be ignored by the interpreter.

So, our goal is to print Malbolge program.

As it follows from the specification, for the output of the symbol for printing there corresponds the instruction <, which interprets the value in the register A (accumulator), as the code of the ASCII-symbol. ASCII is an 8-bit encoding, that is, it contains 256 characters. If value in A is greater than 256, the symbol code is the remainder of dividing it by the table size: [A] mod 256.

Our task is to form the necessary value. In the arsenal of the language, there are 2 instructions that affect the value in A: * and p. Let's take a closer look at them.

The algorithm of the instruction *:

  1. Convert the value in memory at address D into a 10-bit ternary number. Suppose that the source code at position D (and since when you load the program into memory it is mapped directly, then at address D in memory) is the symbol ~. We look at the ASCII table, find out that the symbol code is 126. Convert to a ternary number: 11200. Adding zeros on the left to 10 digits: 0000011200.
  2. Rotates the digits in a number 1 to the right, i.e., the first becomes the second, the second becomes the third, ... the latter becomes the first. 0000011200 => 0000001120.
  3. Write the resulting value to the memory location at address D, replacing the old value with it. Then the new value from D is copied to register A. If we wanted to display the contents of A on the screen, what would we get? Convert 0000001120 to the decimal number: 0000001120 => 42 (the value is less than 256, so you do not need to take anything modulo). The symbol #42 in ASCII is *. It would have been printed.

The algorithm of the instruction p is exactly the same, with one exception: instead of the result of a cyclic shift in D, the result of applying the Crazy ternary operation proposed by the author to [D] and [A] is recorded.

It would seem that everything is simple and the algorithm for the generation of any required value is in our pocket, but it was not there. Two of the three registers, C and D, act as pointers: the first refers to the current instruction, the second to the current position with the data. And when the program is loaded into memory, both registers are initialized to zero. That is, when the program is running, any instruction immediately acts data for its execution.

Let's consider a small example. Let it be the simplest program:

*<v

or in an unnormalized form

'bO

Let's follow the first instruction:

  1. ASCII-code ' is 39. We translate it to a 10-bit ternary number: 0000001110.
  2. Cyclic shift to the right: 0000000111.
  3. Convert back to decimal: 13. This is a non-graphical character, so the < instruction won't print anything on the screen.

But it's not important. The conclusion is important: we can not "slip" the data we would like to the instruction *. They are rigidly dependent on the instruction itself and its location in the source code. Well, since we can not influence the instruction itself, we have to change its location. Yes, in this way it's really possible to output the required text, but the program will grow to insane sizes (and do not forget about the restriction on the size of the source code of 59049 characters). Here is the shortest example that prints a character *:

ooooooo*<v

or in an unnormalized form

DCBA@?>~[H

But is it possible to avoid this problem at all? True, we can. Not for nothing that there are instructions j and i in the syntax of the language. The first sets as a pointer to the data [D], the second uses the same value as a pointer to the code. That is, they both perform the role of jump, delimiting the areas of code and data.

If to show clearly, here is the structure of the code using j:

...j[code]...[data]
   \_________^

The same, with i:

...i[data]...[code]
   \_________^

The question arises: which option should we choose in our case?

Let's try the first option. Just write a few instructions j in sequence and see if we can far "jump" from any of the positions:

jjjjjjjjjjjjjjj

in the unnormalized form:

('&%$#"!~}|{zyx

ASCII code of ( equal to 40, so we can jump to this position if the first instruction is j. The ASCII codes of the following characters all decrease until they reach ! with code 33 in the eighth position. If the instruction j is located at the next, ninth position, the position #126 (ASCII code of the character ~) will act as the "destination".

That is, either under our code, which should form the Malbolge program phrase, there will be a maximum of 40 instructions (there will not be enough - believe me, I tried :), or something in the area of 126 (and that's too much free space for code of output).

So I stopped on the second option. Having done the same estimations, it's easy to find out that the maximum we can jump to 98 position (ASCII-code of the symbol b - the normalized representation of the command i, standing on the first position). The further from the beginning the instruction i is located, the closer we will jump.

In the end, our code will look something like this:

...i[{instructions set}...{instructions set}]...[{set of *,p,o}<...<{set of *,p,o}<v]

o - nop-instruction, in fact doing nothing. However, in this case it can be useful: it may turn out that the length of a set of *, p and o outputting the required symbol will be shorter than the length of the set consisting only of * and p.

The last thing left is to choose the appropriate sets of instructions. I won't describe the search algorithm in detail (I will probably do this later) - it is quite simple and reduces to sequential search of the *, p and o instructions in the code block and in general all instructions in the corresponding data block. Of course, I did not seek the solution manually, but I wrote for this the generator in Mathematica (sorry, guys, I was amused as best I could). The code is really terrible, but please make a discount at the time of writing - it was a few years ago :)

In the end, I got the following result:

ooooooi/iojpo*pivojji/oovvoipooooo/oo*o<o/oo/iojo/oo/o/joijoo/ooooooooooooooooooooooooooooooo*p<*p<*ppp<*ppp<opp<*p<pp<pp<o*<*op<o*p<*op<*p<*p<pp<*p<v

Later, some instructions were replaced, because they made the source code of the program in another language incorrect. Plus, 2 empty instructions were added almost at the very end - I don't remember the exact reason, perhaps it was an attempt to add Befunge (it did not take off :). Summary view:

ooooooi/iojpo*pivojji/ijvvoipooooo/ji*p<o/oo/iojo/oo/o/joijpo/oopop<p<*p*o*<*ppo**p<pp*<oppoo*p<*p<*ppp<*ppp<opp<*p<pp<pp<o*<*op<**p<*op<*p<*p<pp<*poo<v

Or in a more visual representation, where you can see which part of the source is responsible for what:

ooooooi         jump
,_____/
|   /io         data for 'M'
|   jpo         data for 'a'
|   *pivo       data for 'l'
|   jji/i       data for 'b'
|   jvvo        data for 'o'
|   ipo         data for 'l'
|   ooo         data for 'g'
|   o/j         data for 'e'
|   i*p         data for ' '
|   <o/o        data for 'p'
|   o/io        data for 'r'
|   jo/o        data for 'o'
|   o/o         data for 'g'
|   /jo         data for 'r'
|   ijp         data for 'a'
|   o/oop       data for 'm'
|   op<p<*p     do nothing (some instructions between data & code)
|   *o*<*pp     --//--
|   o**p<pp     --//--
v   *<oppoo     --//--
*p<             print 'M'
*p<             print 'a'
*ppp<           print 'l'
*ppp<           print 'b'
opp<            print 'o'
*p<             print 'l'
pp<             print 'g'
pp<             print 'e'
o*<             print ' '
*op<            print 'p'
**p<            print 'r'
*op<            print 'o'
*p<             print 'g'
*p<             print 'r'
pp<             print 'a'
*poo<           print 'm'
v               stop

The same code is in an unnormalized form:

DCBA@?\nZ;|38x0SA3tsN`Lo98*G"'&%$#Sc>`v<zLxwI5tWrDpoAm?Oj)Laf8dc\aZ~X|?U=Y;v9ONS54JnHG/jJCBGF(>b%;_"876Z{321U5.-Qr*N('K%$H(hEf${Abaw=^zs9Zp6Wm3kj0Qglk+v

Or more visually (don't try to run this code in the interpreter - Malbolge doesn't tolerate the comments, considering them part of their code):

DCBA@?\     jump
nZ;         data for 'M'
|38         data for 'a'
x0SA3       data for 'l'
tsN`L       data for 'b'
o98*        data for 'o'
G"'         data for 'l'
&%$         data for 'g'
#Sc         data for 'e'
>`v         data for ' '
<zLx        data for 'p'
wI5t        data for 'r'
WrDp        data for 'o'
oAm         data for 'g'
?Oj         data for 'r'
)La         data for 'a'
f8dc\       data for 'm'
aZ~X|?U     do nothing (some instructions between data & code)
=Y;v9ON     --//--
S54JnHG     --//--
/jJCBGF     --//--
(>b         print 'M'
%;_         print 'a'
"876Z       print 'l'
{321U       print 'b'
5.-Q        print 'o'
r*N         print 'l'
('K         print 'g'
%$H         print 'e'
(hE         print ' '
f${A        print 'p'
baw=        print 'r'
^zs9        print 'o'
Zp6         print 'g'
Wm3         print 'r'
kj0         print 'a'
Qglk+       print 'm'
v           stop

Almost all of these sources you can check, for example, in Matthias Ernst's debugger. Very handy tool, with switching to normalized mode and back.

So, the first - and, perhaps, the most difficult part - is completed. Go ahead.

Clone this wiki locally