Friday, November 13, 2009

12.4. Characters and Strings










 < Free Open Study > 







12.4. Characters and Strings



This section provides some tips for using strings. The first applies to strings in all languages.



Cross-Reference





Issues for using magic characters and strings are similar to those for magic numbers discussed in Section 12.1, "Numbers in General."




Avoid magic characters and strings Magic characters are literal characters (such as 'A') and magic strings are literal strings (such as "Gigamatic Accounting Program") that appear throughout a program. If you program in a language that supports the use of named constants, use them instead. Otherwise, use global variables. Several reasons for avoiding literal strings exist:



  • For commonly occurring strings like the name of your program, command names, report titles, and so on, you might at some point need to change the string's contents. For example, "Gigamatic Accounting Program" might change to "New and Improved! Gigamatic Accounting Program" for a later version.

  • International markets are becoming increasingly important, and it's easier to translate strings that are grouped in a string resource file than it is to translate to them in situ throughout a program.

  • String literals tend to take up a lot of space. They're used for menus, messages, help screens, entry forms, and so on. If you have too many, they grow beyond control and cause memory problems. String space isn't a concern in many environments, but in embedded systems programming and other applications in which storage space is at a premium, solutions to string-space problems are easier to implement if the strings are relatively independent of the source code.

  • Character and string literals are cryptic. Comments or named constants clarify your intentions. In the next example, the meaning of 0x1B isn't clear. The use of the ESCAPE constant makes the meaning more obvious.





C++ Examples of Comparisons Using Strings










if ( input_char == 0x1B ) ... <-- 1

if ( input_char == ESCAPE ) ... <-- 2








(1)Bad!

(2)Better!





Watch for off-by-one errors Because substrings can be indexed much as arrays are, watch for off-by-one errors that read or write past the end of a string.



cc2e.com/1285



Know how your language and environment support Unicode In some languages such as Java, all strings are Unicode. In others such as C and C++, handling Unicode strings requires its own set of functions. Conversion between Unicode and other character sets is often required for communication with standard and third-party libraries. If some strings won't be in Unicode (for example, in C or C++), decide early on whether to use the Unicode character set at all. If you decide to use Unicode strings, decide where and when to use them.



Decide on an internationalization/localization strategy early in the lifetime of a program Issues related to internationalization and localization are major issues. Key considerations are deciding whether to store all strings in an external resource and whether to create separate builds for each language or to determine the specific language at run time.



cc2e.com/1292



If you know you only need to support a single alphabetic language, consider using an ISO 8859 character set For applications that need to support only a single alphabetic language (such as English) and that don't need to support multiple languages or an ideographic language (such as written Chinese), the ISO 8859 extended-ASCII-type standard makes a good alternative to Unicode.



If you need to support multiple languages, use Unicode Unicode provides more comprehensive support for international character sets than ISO 8859 or other standards.



Decide on a consistent conversion strategy among string types If you use multiple string types, one common approach that helps keep the string types distinct is to keep all strings in a single format within the program and convert the strings to other formats as close as possible to input and output operations.





Strings in C



C++'s standard template library string class has eliminated most of the traditional problems with strings in C. For those programmers working directly with C strings, here are some ways to avoid common pitfalls:



Be aware of the difference between string pointers and character arrays The problem with string pointers and character arrays arises because of the way C handles strings. Be alert to the difference between them in two ways:



  • Be suspicious of any expression containing a string that involves an equal sign. String operations in C are nearly always done with strcmp(), strcpy(), strlen(), and related routines. Equal signs often imply some kind of pointer error. In C, assignments do not copy string literals to a string variable. Suppose you have a statement like

    StringPtr = "Some Text String";



    In this case, "Some Text String" is a pointer to a literal text string and the assignment merely sets the pointer StringPtr to point to the text string. The assignment does not copy the contents to StringPtr.

  • Use a naming convention to indicate whether the variables are arrays of characters or pointers to strings. One common convention is to use ps as a prefix to indicate a pointer to a string and ach as a prefix for an array of characters. Although they're not always wrong, you should regard expressions involving both ps and ach prefixes with suspicion.



Declare C-style strings to have length CONSTANT+1 In C and C++, off-by-one errors with C-style strings are common because it's easy to forget that a string of length n requires n + 1 bytes of storage and to forget to leave room for the null terminator (the byte set to 0 at the end of the string). An effective way to avoid such problems is to use named constants to declare all strings. A key in this approach is that you use the named constant the same way every time. Declare the string to be length CONSTANT+1, and then use CONSTANT to refer to the length of a string in the rest of the code. Here's an example:





C Example of Good String Declarations










/* Declare the string to have length of "constant+1".

Every other place in the program, "constant" rather

than "constant+1" is used. */

char name[ NAME_LENGTH + 1 ] = { 0 }; /* string of length NAME_LENGTH */ <-- 1



...

/* Example 1: Set the string to all 'A's using the constant,

NAME_LENGTH, as the number of 'A's that can be copied.

Note that NAME_LENGTH rather than NAME_LENGTH + 1 is used. */

for ( i = 0; i < NAME_LENGTH; i++ ) <-- 2

name[ i ] = 'A';

...



/* Example 2: Copy another string into the first string using

the constant as the maximum length that can be copied. */

strncpy( name, some_other_name, NAME_LENGTH ); <-- 3








(1)The string is declared to be of length NAME_LENGTH +1.

(2)Operations on the string using NAME_LENGTH here…

(3)…and here.





If you don't have a convention to handle this, you'll sometimes declare the string to be of length NAME_LENGTH and have operations on it with NAME_ LENGTH-1; at other times you'll declare the string to be of length NAME_LENGTH+1 and have operations on it work with length NAME_LENGTH. Every time you use a string, you'll have to remember which way you declared it.



When you use strings the same way every time, you don't have to remember how you dealt with each string individually and you eliminate mistakes caused by forgetting the specifics of an individual string. Having a convention minimizes mental overload and programming errors.



Initialize strings to null to avoid endless strings C determines the end of a string by finding a null terminator, a byte set to 0 at the end of the string. No matter how long you think the string is, C doesn't find the end of the string until it finds a 0 byte. If you forget to put a null at the end of the string, your string operations might not act the way you expect them to.



Cross-Reference





For more details on initializing data, see Section 10.3, "Guidelines for Initializing Variables."




You can avoid endless strings in two ways. First, initialize arrays of characters to 0 when you declare them:





C Example of a Good Declaration of a Character Array




char EventName[ MAX_NAME_LENGTH + 1 ] = { 0 };





Second, when you allocate strings dynamically, initialize them to 0 by using calloc() instead of malloc(). calloc() allocates memory and initializes it to 0. malloc() allocates memory without initializing it, so you take your chances when you use memory allocated by malloc().



Use arrays of characters instead of pointers in C If memory isn't a constraint�and often it isn't�declare all your string variables as arrays of characters. This helps to avoid pointer problems, and the compiler will give you more warnings when you do something wrong.



Cross-Reference





For more discussion of arrays, read Section 12.8, "Arrays," later in this chapter.




Use strncpy() instead of strcpy() to avoid endless strings String routines in C come in safe versions and dangerous versions. The more dangerous routines such as strcpy() and strcmp() keep going until they run into a null terminator. Their safer companions, strncpy() and strncmp(), take a parameter for maximum length so that even if the strings go on forever, your function calls won't.












     < Free Open Study > 



    No comments:

    Post a Comment