Join Strings with a Delimiter

This function joins together unicode strings into one large unicode string by using a specified unicode string as a delimiter.

Unicode code strings occupy 16 bits per character.

The strings in question are stored in a special way. They are not terminated with a unicode '\0' character. Instead, the word (2 bytes) just before the address of the string stores the length of the string. The length is the number of characters, not the number of bytes.

This function assumes that the first argument on the stack is the address of the output buffer and the second is the address of the delimiter. After that come the strings to join; there can be a variable number of these, with a NULL address indicating the end of the sequence.

The program is responsible for popping the arguments off the stack, but not for preserving registers that it uses. It can assume that the output buffer is large enough to hold the result.

Remember that rep movsw moves ecx words from [esi] to [edi]. Because movsw increments esi and edi each time it is executed, at the end of the rep movsw command, esi and edi point just past the end of the source and destination strings.

Source Code


 1.      dest      dd ?  ; store output buffer address here

 2.      delimiter dd ?  ; store delimiter address here

 3.

 4.  join_string:

 5.      pop edi       ; output buffer

 6.      mov word ptr [edi-2], 0  ; length is 0 to begin

 7.      mov dest, edi ; save it in a variable

 8.      pop eax       ; delimiter

 9.      mov delimeter, eax  ; save it in a variable

10.

11.      mov ecx, 0    ; cx holds various lengths; ensure

12.                   ; the high 16 bits are zero

13.

14.  nextstring:

15.      ; loop back here to take a string off the stack

16.      pop esi

17.      cmp esi, 0    ; NULL string, we are done

18.      je done

19.

20.      ; if this is the first string, don't add delimiter

21.      mov edx, dest

22.      cmp word ptr [edx-2], 0  ; length is 0, first string

23.      je copystring

24.

25.      ; not first string, have to copy the delimeter

26.      push esi      ; save this (next string address)

27.      mov esi, delimeter

28.      mov cx, word ptr [esi-2]  ; length of delimiter

29.      add word ptr [edx-2], cx  ; add to length of buffer

30.      rep movsw     ; copy delimeter

31.

32.  copystring:

33.      mov cx, word ptr [esi-2]  ; length of string

34.      add word ptr [edx-2], cx  ; add to length of buffer

35.      rep movsw     ; copy string

36.      jmp nextstring

37.

38.  done:

Suggestions

Describe the meaning of edi as it is used throughout the procedure. Is it used correctly given this meaning?
When the procedure exits, has it properly pulled all the arguments off the stack?
The length of the output buffer is not updated after the done label. This means that it must be kept updated during the procedure. Check that it will always match the number of characters in the buffer.
The procedure deals with unicode strings whose length is specified in characters, not bytes. Make sure that all string lengths reflect this rule.

Hints

Walk through the function with the following values on the stack at the beginning. (The current stack location is at the bottom. The inputs are specified as literal strings, which indicate that the address of a unicode string with that value is on the stack, with the length properly specified in the word just before the location the address points to. The output buffer is shown as an empty string, meaning the length is 0, but the location can be assumed to have enough room to store the result string.)

No strings to join:
NULL
"+" [the delimiter]
"" [the output buffer]
Only one string:
NULL
"test"
"-" [the delimiter]
"" [the output buffer]
Two strings, two-character delimiter:
NULL
"words"
"two"
"**" [the delimiter]
"" [the output buffer]
Three strings, empty delimiter:
NULL
"c"
"b"
"a"
"" [the delimiter, empty in this case]
"" [the output buffer]

Explanation of the Bug

Line 26 saves esi, which is storing the address of the next string, on the stack


push esi      ; save this (next string address)

so it can be used in lines 27-30 as the source address for the delimiter copy, as required by movsw. However, it never pops it back off the stack.

This is an F.missing error. The effect is that, if the delimiter ever has to be copied, the string gets pushed on the stack at line 26, then pulled off again as the next string at line 16. This results in the program looping forever and appending the same string to the output buffer until it crashes accessing memory it is not allowed to access.

A line of code needs to be added at line 31:


pop esi       ; restore this (next string address)

The only time this bug does not happen is if the delimiter is never copied, which means that zero or one strings are passed in to be joined.

Table of Contents