Check if Two Words Are Anagrams

This function checks if two words are anagrams (which means that they have the same letters, but possibly in a different order). It ignores case differences in letters.

The program counts the letters in a string by allocating a 256-byte array and storing, in index n of the array, the number of times that the letter with an ASCII value of n appears (with an adjustment for case insensitivity-all letters are upper-cased first). This means that the program can't handle more than 256 of any one letter-that is not the bug. After the letter totals for the two words are counted, they are compared to see if they match, in which case the two words are anagrams.

The main procedure is called check_anagram. It calls a procedure do_count to count the letters in a single word. The address of the two words (strings terminated with a '\0' character) are pushed on the stack as parameters to check_anagram.

The procedures are not responsible for popping their parameters off the stack. (When check_anagram calls do_count, it is the responsibility of check_anagram, as the caller, to clean up the stack.) Because do_count is only called by check_anagram and, at a point where no registers need to be saved, do_count does not preserve passed-in registers, but check_anagram does.

The program uses the following string instructions:

stosd. Store eax into dword ptr [edi]
scasb. Compare al to byte ptr [edi]
lodsb. Load byte ptr [esi] into al
cmpsd. Compare dword ptr [esi] to dword ptr [edi]

Keep in mind how the rep prefix works on each iteration. If ecx is non-zero, it performs the primitive string instruction, decrements ecx without changing flags, and then (for cmps and scas) checks the zero flag result from the primitive operation.

The following instruction


sete al

sets al to 1 if the zero flag is 1; otherwise, it sets al to 0. (There is an entire collection of setXX instructions that match up with the jXX status-flag-based conditional jump instructions: setne, setl, and so on. They all set a byte to 1 if the condition is true, and 0 otherwise.)

Source Code


 1.      ; helper procedure called by check_anagram

 2.  do_count:

 3.

 4.      ; do_count expects word, then count array

 5.      ; to be pushed

 6.      mov edi, [esp + 4]   ; count array

 7.      mov ecx, 64

 8.      mov eax, 0

 9.      rep stosd          ; clear count array

10.

11.      ; find the length of word

12.      mov edi, [esp + 8]   ; word

13.      mov ecx, -1  ; start it at -1 so it won't cause

14.                   ; the repne scasb to end

15.      repne scasb  ; terminates when [edi] points to a 0

16.                   ; ecx will be -(length of word + 2)

17.      not ecx      ; ecx is now (length of word + 2)

18.      sub ecx, 2   ; ecx is now length of word

19.

20.      jecxz countdone  ; if length is zero, skip word

21.

22.      mov edi, [esp + 4]  ; count array

23.      mov esi, [esp + 8]  ; word

24.

25.  startcount:

26.      lodsb        ; put next character in al

27.      cmp al, 'a'  ; if less than 'a'...

28.      jl dontupper

29.      cmp al, 'z'  ; ...or greater than 'z'

30.      jg dontupper ; ...don't uppercase it

31.      sub al, 'a'

32.      add al, 'A'

33.

34.  dontupper:

35.      inc byte ptr [edi+eax]   ; update count array

36.      loop startcount

37.  countdone:

38.      ret

39.

40.      ; the main procedure

41.  check_anagram:

42.

43.      push ebp

44.      mov ebp, esp

45.      push ebx

46.      push ecx

47.      push esi

48.      push edi

49.      sub esp, 512  ; allocate count arrays

50.

51.      mov ebx, [ebp + 8]    ; first word

52.      push ebx

53.      lea ebx, [ebp - 528]

54.      push ebx

55.      call do_count

56.      add esp, 8            ; clean parameters off stack

57.

58.      mov ebx, [ebp + 12]   ; second word

59.      push ebx

60.      lea ebx, [ebp - 272]

61.      push ebx

62.      call do_count

63.      add esp, 8            ; clean parameters off stack

64.

65.      mov eax, 0

66.      lea esi, [ebp - 528]

67.      lea edi, [ebp - 272]

68.      mov ecx, 64

69.      rep cmpsd

70.

71.      sete al    ; eax will be 1 if it matched, else 0

72.

73.      add esp, 512  ; remove local variables

74.      pop edi

75.      pop esi

76.      pop ecx

77.      pop ebx

78.      pop ebp

79.      ret

Suggestions

It's best to focus on the do_count procedure first because it is called by check_anagram. Think of a trivial input to do_count, as well as an input that exercises all the code paths.
The statement on line 35 is indexing into the count array that was passed as a parameter to do_count. What ensures that this will not index past the end of the array?
Check that ecx is used correctly in all the rep instructions.
Check that the count arrays allocated as local variables to check_anagram on line 49 are correctly addressed.

Hints

Walk through the code with the following parameters to check_anagram:

Anagram, case differs:
"a"
"A"
Anagram, more than one character:
"123"
"321"
Not an anagram, one string is blank:
""
"hello"
Not an anagram, differ only in count of one letter:
"abca"
"abc"

Explanation of the Bug

The calculation of the length of the word on lines 13-18 is incorrect:


mov ecx, -1  ; start it at -1 so it won't cause

             ; the repne scasb to end

repne scasb  ; terminates when [edi] points to a 0

             ; ecx will be -(length of word + 2)

not ecx      ; ecx is now (length of word + 2)

sub ecx, 2   ; ecx is now length of word

After the repne scasb, ecx will indeed be -(length of word + 2), as the comment states. The easiest word to check this against is an empty one, where byte ptr [edi] is 0 the first time the scasb instruction on line 15 executes. In this case, based on the rules of how the repne prefix works, it still decrements ecx once, so after line 16 ecx ends up as -2 for a zero-length string.

The bug is on line 17. The not instruction inverts every bit in ecx, which is not the same as negating it. To negate a number in two's complement, you invert every bit and then add one. Since the "add one" step is missing, ecx winds up as one less than it should be.

This could be classified as an A.off-by-one or a D.limit error because the behavior is that the code skips counting the last character in the word-unless the string is zero-length, in which case, ecx winds up as -1 and the startcount loop on lines 23-34 iterates (almost) forever (until it accesses bad memory).

The fix is to change line 17 to use the correct opcode to negate ecx, as follows:


neg ecx      ; ecx is now (length of word + 2)

Table of Contents