Previous Section Table of Contents Next Section

Check if Parentheses Match in Source Code

This program checks if parentheses match up in a buffer that contains source code.

The source code is in a language similar to C. It supports the following:

  • Comments that start with /* and end with */, and can span multiple lines.

  • Comments that start with //, and run to the end of the line.

  • Strings that are delimited with double quote (") characters.

In addition, the following rules apply:

  • Inside a /* */ comment, // and " are ignored.

  • Inside a // comment, /*, */, and " are ignored.

  • Inside a string, comment delimiters are ignored.

  • Inside a string, the two-character sequence \" indicates an escaped double quote, which means that the string contains a double quote character and does not terminate.

  • A newline ends a string constant.

The program uses a state machine. The states are identified by a number:

  • 0. Regular text [looking for / or "]

  • 1. In state 0, hit / [looking for / or *]

  • 2. In state 0, hit " [now in " string, looking for \ or " or \n]

  • 3. In state 1, hit * [now in /* comment, looking for *]

  • 4. In state 3, hit * [in /* comment, looking for /]

  • 5. In state 1, hit / [now in // comment, looking for \n]

  • 6. In state 2, hit \ [in " string, looking for "]

The program special cases the characters *, /, ", \, and newline ('\n') when they appear in the buffer. For each character, the program uses a table of which state to move to next, based on the current state:


*  = "0324352"

/  = "1523052"

"  = "2003352"

\  = "0063352"

\n = "0003300"


For example, this means that if the program encounters a * while in state 1, it should move to state 3. This corresponds to having seen a / as the previous character and now seeing a *, which puts the state machine into the "in /* comment" state.

There is also a special table for what to do if the program encounters any other character-in some cases, the state stays unchanged; in other cases, it won't:


other = "0023352"


For example, if in state 1 (just saw a /, looking for * or /) and any other character is seen, the program should revert back to state 0, normal text. But in state 2 (inside a " string), the state should remain unchanged unless one of the special characters is seen.

The program has to only track parentheses (for the purpose of determining if they are matched up) when in states 0 or 1.

On entry, the variable textdata holds a pointer to the buffer, which is terminated with a '\0' character. The variable chars points to a '\0'-terminate string that contains the special characters, in order: *, /, ", \, newline (this string would be written in C, with appropriate escape characters, as "*/\"\\\n").

In addition, the six variables startable, slashtable, doublequotetable, backslashtable, newlinetable, and othertable point to '\0'-terminate strings containing the state transition tables previously shown.

Remember precisely how repne scasb works. For each iteration, it does the following:

  • Exit if ecx is 0.

  • Perform the scasb (which sets the flags based on al - byte ptr [edi]).

  • Decrement ecx without changing the flags.

  • Exit if the zero flag is 1.

The program is a code fragment, not a procedure. It is not responsible for preserving registers. It should return the result in eax: 1 if the parentheses match, 0 if they don't.

Source Code


 1.      mov esi, textdata  ; esi holds current pointer

 2.      mov ecx, 0         ; ecx holds the paren depth

 3.      mov ebx, 0         ; ebx holds the current state

 4.

 5.      sub esp, 24        ; allocate room for tables

 6.      mov eax, startable

 7.      mov [esp], eax

 8.      mov eax, slashtable

 9.      mov [esp+4], eax

10.      mov eax, doublequotetable

11.      mov [esp+8], eax

12.      mov eax, backslashtable

13.      mov [esp+12], eax

14.      mov eax, newlinetable

15.      mov [esp+16], eax

16.      mov eax, othertable

17.      mov [esp+20], eax

18.

19.  nextchar:

20.      cmp byte ptr [esi], 0  ; is next character '\0'?

21.      je done

22.

23.      cmp ebx, 1         ; if state is not 0 or 1

24.      jg lookforspecial  ; don't worry about ( or )

25.      cmp byte ptr [esi], '('

26.      jne checkforright

27.      inc ecx            ; increase paren depth

28.      xor ebx, ebx       ; go back to state 0

29.      jmp skiplookup     ; it's a (, so not special

30.

31.  checkforright:

32.      cmp byte ptr [esi], ')'

33.      jne lookforspecial

34.      dec ecx            ; decrease paren depth

35.      xor ebx, ebx       ; go back to state 0

36.      jmp skiplookup     ; it's a ), so not special

37.

38.  lookforspecial:

39.      push ecx           ; save this temporarily

40.      mov edi, chars     ; chars is '*','/','"','\\','\n'

41.      mov al, [esi]      ; load next character

42.      mov ecx, 6

43.      repne scasb

44.

45.      mov eax, 5

46.      sub eax, ecx       ; ecx is index in chars that

47.                         ; matched, or 5 if no match

48.      shl eax, 2         ; multiply by 4

49.      mov edi, [esp+4]   ; +4 because ecx was pushed

50.      add edi, eax       ; index into tables

51.      mov eax, [edi]     ; eax is now address of table

52.

53.      mov cl, [eax+ebx]  ; cl is table[current state]

54.      sub cl, '0'        ; convert digit to number

55.

56.      mov bl, cl         ; update current state

57.

58.      pop ecx            ; retrieve the paren depth

59.

60.  skiplookup:

61.      inc esi            ; next character in buffer

62.      jmp nextchar

63.

64.  done:

65.      jecxz matched      ; jump if ecx is 0

66.      mov eax, 0         ; ecx non-zero, didn't match up

67.      jmp end

68.  matched:

69.      mov eax, 1         ; ecx is zero, matched up

70.  end:

71.      add esp, 24        ; remove local storage


Suggestions

  1. There are really two parts to checking this program: ensuring that the state machine transition tables are correct and ensuring that the program processes them correctly. Are the state machine transition tables correct?

  2. Check that esi and ebx hold their meanings properly throughout the program.

  3. Walk through the instructions on lines 42-47 carefully to verify that the comment on lines 46-47 is correct.

  4. Is the return value calculated correctly?

Hints

It is easiest to test the program on small inputs that check some particular aspect of the state machine, rather than try a full program all at once. The strings are shown without escape characters or double quotes to make them clearer:

  1. Make sure basic parentheses matching works:

    abc((def)ghi)

  2. Check /* comments:

    ( /* )" */ )

  3. Check // comments and too many left parentheses:

    (abc // def

  4. Check strings and too many right parentheses:

    ( ")\")"123))

Explanation of the Bug

Line 49 is incorrect:


mov edi, [esp+4]   ; +4 because ecx was pushed


The code tries to set edi to be the base address of the array of state-machine transition tables that is stored in local variables on the stack. However, this statement actually puts the address of the first table in edi. What it should be doing is putting the address of the entire array in edi. Thus, it should read as follows:


lea edi, [esp+4]   ; +4 because ecx was pushed


This is a B.expression error. There is an extra dereference of the "pointer" on the stack. The effect is that the program, instead of looking up a table in an array of tables on line 51, looks up a table in a string, which results in a crash.

    Previous Section Table of Contents Next Section