Previous Section Table of Contents Next Section

Kanji Backspace

This function backspaces within text stored using the encoding scheme known as double-byte character sets (DBCS). Although other writing systems are encoded using DBCS, this problem has acquired the name Kanji Backspace.

A text string encoded in DBCS contains a mixture of single-byte and double-byte characters. Single-byte characters are the ASCII characters that many programmers are familiar with; a capital 'A' is represented by the single byte 65. Double-byte characters occupy 2 bytes in the encoding.

To tell them apart, single characters in DBCS are allowed to use only 7 bits, with the high bit off. In a double-byte character, the first byte has the high bit on, and the second byte can be any value.

Note

We will refer to bytes that have the eighth bit on as high bytes and ones that do not have it on as low bytes, which are abbreviated as H and L. So, a double-byte character made up of a high byte followed by a low byte is called an HL character.


Thus, moving forward through a DBCS string, it is easy to determine character boundaries: If you encounter a low byte, it is a single-byte character; if you encounter a high-byte, it is the first byte of a double-byte character; and then repeat. However, it is tricky to backspace because a low byte could be either a single-byte character or the second byte of a double-byte character.

The Kanji Backspace algorithm is based on the observation that a low byte always precedes a character boundary. That is, the byte after a low byte is always the beginning of a new character. Using this information, the code can scan backward to find a known character boundary.

The function takes the start of the string and the current location in the string as parameters. It needs the string's start because sometimes you have to walk all the way back to the start of the string to figure out the character boundaries. The function returns a pointer to the previous character.

Source Code


 1.     char *

 2.     kanji_backspace(

 3.         char * string_start,

 4.         char * current) {

 5.

 6.         char * location, * return_value;

 7.         int distance;

 8.

 9.         if ((*(current-1)) & 0x80) { // byte before current?

10.             // H byte, must be second byte of HH character

11.             return_value = current - 2;

12.         } else {

13.             // L byte, so find a previous reliable character

14.             // boundary. The algorithm is to scan back until

15.             // we find another L byte, or hit string_start.

16.             location = current-2;

17.             while (location >= string_start) {

18.                 if (((*location) & 0x80) == 0) {

19.                     break;

20.                 }

21.                 --location;

22.             }

23.

24.             // The byte right after location is the start of

25.             // a character. See how far it is from current.

26.             distance = (current - (location+1));

27.             if ((distance % 2) == 0) {

28.                 // series of HH chars followed by HL char

29.                 return_value = current-2;

30.             } else {

31.                 // series of HH chars followed by L char

32.                 return_value = current-1;

33.             }

34.         }

35.

36.         // Check to make sure we have not moved back before

37.         // string_start. We compare with <= because we are

38.         // going back one character.

39.         if (return_value <= string_start) {

40.             return NULL;

41.         } else {

42.             return return_value;

43.         }

44.     }


Suggestions

  1. What are the trivial and empty inputs for this function?

  2. What inputs are required to ensure full code coverage?

  3. Restate the if statements on lines 9 and 18 in English.

  4. What is the goal of while on line 17? What is true at line 23?

Hints

Walk through the code with the following parameters to the function:

  1. Test the case where the preceding character is a single-byte character. A buffer test_string is set up as follows:

    test_string[0] == 0x9f test_string[1] == 0x68 test_string[2] == 0x86 test_string[3] == 0xa0 test_string[4] == 0x34 test_string[5] == 0x65

    That is, it contains an HL character, an HH character, and an L character.

    The function is called as follows:

    kanji_backspace(test_string, test_string+5)

  2. Try the case where the preceding character is a double-byte character and the first character in the string. A buffer test_string is set up the same way:

    test_string[0] == 0x9f test_string[1] == 0x68 test_string[2] == 0x86 test_string[3] == 0xa0 test_string[4] == 0x34 test_string[5] == 0x65

    The function is called as follows:

    kanji_backspace(test_string, test_string+2)

Explanation of the Bug

The check performed at the end, on line 39, to make sure that you did not step back before the beginning of the string,


if (return_value <= string_start) {


is incorrect. It is acceptable for return_value to point to string_start, if the function was called with the parameter current pointing to the second character in the string. This is an A.off-by-one bug; the check should be as follows:


if (return_value < string_start) {


The comment preceding the test


// Check to make sure we have not moved back before

// string_start. We compare with <= because we are

// going back one character.


is a red herring; although the function is indeed going back one character, that has nothing to do with whether this check should use <= or <.

With the situation in the first hint, the byte before current is a low byte (0x34). The if on line 9 is false and the code runs through the else case starting on line 12. The while loop starting on line 17 never finds another L character, so at line 23, after the while loop, location winds up as string_start-1. Because current was passed in as string_start+5, the calculation on line 26 sets distance to be 5; therefore, the string is as described in the comment on line 31 and return_value is set properly on line 32. The incorrect test on line 39 doesn't cause an error because return_value is greater than string_start.

With the second hint, the byte before current is an H byte, so if on line 9 is true and return_value is set correctly on line 11. However, return_value happens to be equal to string_start, so the incorrect check on line 39 is true and the function winds up incorrectly returning NULL on line 40.

    Previous Section Table of Contents Next Section