Check if Parentheses Match in Source Code
This program checks if parentheses match up in a buffer that contains source code.
The source code is in a language similar to C. It supports the following:
Comments that start with /* and end with */, and can span multiple lines. Comments that start with //, and run to the end of the line. Strings that are delimited with double quote (") characters.
In addition, the following rules apply:
Inside a /* */ comment, // and " are ignored. Inside a // comment, /*, */, and " are ignored. Inside a string, comment delimiters are ignored. Inside a string, the two-character sequence \" indicates an escaped double quote, which means that the string contains a double quote character and does not terminate. A newline ends a string constant.
The program uses a state machine. The states are identified by a number:
0. Regular text [looking for / or "] 1. In state 0, hit / [looking for / or *] 2. In state 0, hit " [now in " string, looking for \ or " or \n] 3. In state 1, hit * [now in /* comment, looking for *] 4. In state 3, hit * [in /* comment, looking for /] 5. In state 1, hit / [now in // comment, looking for \n] 6. In state 2, hit \ [in " string, looking for "]
The program special cases the characters *, /, ", \, and newline ('\n') when they appear in the buffer. For each character, the program uses a table of which state to move to next, based on the current state:
* = "0324352"
/ = "1523052"
" = "2003352"
\ = "0063352"
\n = "0003300"
For example, this means that if the program encounters a * while in state 1, it should move to state 3. This corresponds to having seen a / as the previous character and now seeing a *, which puts the state machine into the "in /* comment" state.
There is also a special table for what to do if the program encounters any other character-in some cases, the state stays unchanged; in other cases, it won't:
other = "0023352"
For example, if in state 1 (just saw a /, looking for * or /) and any other character is seen, the program should revert back to state 0, normal text. But in state 2 (inside a " string), the state should remain unchanged unless one of the special characters is seen.
The program has to only track parentheses (for the purpose of determining if they are matched up) when in states 0 or 1.
On entry, the variable textdata holds a pointer to the buffer, which is terminated with a '\0' character. The variable chars points to a '\0'-terminate string that contains the special characters, in order: *, /, ", \, newline (this string would be written in C, with appropriate escape characters, as "*/\"\\\n").
In addition, the six variables startable, slashtable, doublequotetable, backslashtable, newlinetable, and othertable point to '\0'-terminate strings containing the state transition tables previously shown.
Remember precisely how repne scasb works. For each iteration, it does the following:
Exit if ecx is 0. Perform the scasb (which sets the flags based on al - byte ptr [edi]). Decrement ecx without changing the flags. Exit if the zero flag is 1.
The program is a code fragment, not a procedure. It is not responsible for preserving registers. It should return the result in eax: 1 if the parentheses match, 0 if they don't.
Source Code
1. mov esi, textdata ; esi holds current pointer
2. mov ecx, 0 ; ecx holds the paren depth
3. mov ebx, 0 ; ebx holds the current state
4.
5. sub esp, 24 ; allocate room for tables
6. mov eax, startable
7. mov [esp], eax
8. mov eax, slashtable
9. mov [esp+4], eax
10. mov eax, doublequotetable
11. mov [esp+8], eax
12. mov eax, backslashtable
13. mov [esp+12], eax
14. mov eax, newlinetable
15. mov [esp+16], eax
16. mov eax, othertable
17. mov [esp+20], eax
18.
19. nextchar:
20. cmp byte ptr [esi], 0 ; is next character '\0'?
21. je done
22.
23. cmp ebx, 1 ; if state is not 0 or 1
24. jg lookforspecial ; don't worry about ( or )
25. cmp byte ptr [esi], '('
26. jne checkforright
27. inc ecx ; increase paren depth
28. xor ebx, ebx ; go back to state 0
29. jmp skiplookup ; it's a (, so not special
30.
31. checkforright:
32. cmp byte ptr [esi], ')'
33. jne lookforspecial
34. dec ecx ; decrease paren depth
35. xor ebx, ebx ; go back to state 0
36. jmp skiplookup ; it's a ), so not special
37.
38. lookforspecial:
39. push ecx ; save this temporarily
40. mov edi, chars ; chars is '*','/','"','\\','\n'
41. mov al, [esi] ; load next character
42. mov ecx, 6
43. repne scasb
44.
45. mov eax, 5
46. sub eax, ecx ; ecx is index in chars that
47. ; matched, or 5 if no match
48. shl eax, 2 ; multiply by 4
49. mov edi, [esp+4] ; +4 because ecx was pushed
50. add edi, eax ; index into tables
51. mov eax, [edi] ; eax is now address of table
52.
53. mov cl, [eax+ebx] ; cl is table[current state]
54. sub cl, '0' ; convert digit to number
55.
56. mov bl, cl ; update current state
57.
58. pop ecx ; retrieve the paren depth
59.
60. skiplookup:
61. inc esi ; next character in buffer
62. jmp nextchar
63.
64. done:
65. jecxz matched ; jump if ecx is 0
66. mov eax, 0 ; ecx non-zero, didn't match up
67. jmp end
68. matched:
69. mov eax, 1 ; ecx is zero, matched up
70. end:
71. add esp, 24 ; remove local storage
Suggestions
There are really two parts to checking this program: ensuring that the state machine transition tables are correct and ensuring that the program processes them correctly. Are the state machine transition tables correct? Check that esi and ebx hold their meanings properly throughout the program. Walk through the instructions on lines 42-47 carefully to verify that the comment on lines 46-47 is correct. Is the return value calculated correctly?
Hints
It is easiest to test the program on small inputs that check some particular aspect of the state machine, rather than try a full program all at once. The strings are shown without escape characters or double quotes to make them clearer:
Make sure basic parentheses matching works: abc((def)ghi) Check /* comments: ( /* )" */ ) Check // comments and too many left parentheses: (abc // def Check strings and too many right parentheses: ( ")\")"123))
Explanation of the Bug
Line 49 is incorrect:
mov edi, [esp+4] ; +4 because ecx was pushed
The code tries to set edi to be the base address of the array of state-machine transition tables that is stored in local variables on the stack. However, this statement actually puts the address of the first table in edi. What it should be doing is putting the address of the entire array in edi. Thus, it should read as follows:
lea edi, [esp+4] ; +4 because ecx was pushed
This is a B.expression error. There is an extra dereference of the "pointer" on the stack. The effect is that the program, instead of looking up a table in an array of tables on line 51, looks up a table in a string, which results in a crash.
|