I l@ve RuBoard |
The StreamTokenizer ClassStreamTokenizer breaks up an input stream into tokens and can be used to parse a simple file (excuse me, "input stream"). Read the Java API documentation on StreamTokenizer, and then compare what you read to the following methods
StreamTokenizer's variables are ttype (one of the constant values TT_EOF, TT_EOL, TT_NUMBER, and TT_WORD); sval (contains the token of the last string read); and nval (contains the token of the last number read). Using StreamTokenizerReading the documentation probably isn't enough to get you started with StreamTokenizer, so we're going to work with a simple application that produces a report on the number of classes and functions in a Python source file. Here's the source code:
Follow along with the next interactive session. Afterward we'll look at the code to count the classes and functions. Import the FileInputStream class from java.io, and create an instance of it.
Import the StreamTokenizer class, and create an instance of it. Pass its constructor the FileInputStream instance.
Call nextToken() to get the first token in the file (that is, class).
As you can see, nextToken() returns a numeric value, although you may have been expecting a string value containing "class". In fact, nextToken() returns the type of token, that is, a word, a number, or an EOL or EOF (end-of-file) character, so -3 refers to TT-WORD. The ttype variable holds the last type of token read.
The sval variable holds the actual last token read. If we want to check if the last token type was a word, we can write this, and, if it was a word, we can print it out.
Call nextToken() again to get the next token, which is MyClass.
Call nextToken() again; this time it should return the '#' token.
Since the token is a ':' StreamTokenizer doesn't recognize it as valid. The only valid types are NUMBER, EOL, EOF, and WORD. So for ':' to be recognized, it has to be registered with the wordChars() method.
If the type isn't one of these, the number corresponding to the character encoding is returned. Let's see what nextToken() returns for the next character.
The 35 refers to '#', which you can prove with the built-in ord() function.
Get the next token.
The token is a word (-3 equates to TT_WORD). Print sval to find out what the word is.
As you can see, the StreamTokenizer instance is reading text out of the comment from the first line. We want to ignore comments, so we need to return the tokens we took out back into the stream. Push the token back into the stream.
Attempt to push the token before the last one back into the stream.
Set commentChar() to ignore '#'. (commentChar() takes an integer argument corresponding to the encoding of the character.)
Get the next token, and print it out.
Are you wondering why we still have the comment text? The pushback() method can only push back the last token, so calling it more than once won't do any good. Let's start from the beginning, creating a new FileInputStream instance and a new StreamTokenizer instance. Create the StreamTokenizer instance by passing its constructor a new instance of FileInputStream.
Iterate through the source code, printing out the words in the file. Quit the while loop when the token type is EOF.
Notice that the comment text isn't in the words printed out.
Parsing Python with StreamTokenizerOkay, we've done our experimentation. Now it's time for the actual code for counting the classes and functions in our Python source code.
Here's the output:
The main part of the code (where all the action is happening) is
Let's look at it step by step. If the token type isn't equal to EOF, get the next token.
If the token type is WORD,
check to see if the token is a class modifier. If it is, call the parseClass() function, which uses the StreamTokenizer instance to extract the class name and put it on a list.
If the token isn't a class modifier, check to see if it's a function modifier. If so, call parseFunction(), which uses StreamTokenizer to extract the function name and put it on a list.
StreamTokenizer is a good way to parse an input stream. If you understand its runtime behavior (which you should from the preceding interactive session), you'll be more likely to use it. The more astute among you probably noticed that functions and methods were counted together in the last example. As an exercise, change the code so that each class has an associated list of methods and so that these methods are counted separately. Hint: You'll need to use the resetSyntax() method of StreamTokenizer to set all characters to ordinary. Then you'll need to count the spaces (ord(")) and tabs (ord("\t")) that occur before the first word on a line. For this you also need to track whether you hit an EOL token type. (If you can do this exercise, I do believe that you can do any exercise in the book.) As another exercise, create a stream that can parse a file whose contents look like this:
SectionType defines a class of section, and SectionName is like defining a class instance. value equates to a class attribute. Here's an example.
Create a dictionary of dictionaries of dictionaries. The name of the top-level dictionary should correspond to the section type (Communication, Greeting); its value should be a dictionary whose name corresponds to the section names (Client, Host) and whose values correspond to another dictionary. The names and values of the third-level dictionaries should correspond to the name values in the file (sayHello = "Good morning Mr. Bond", type = "RS-232"). If, like baudrate, the name repeats itself, you should create a list corresponding to the name baudrate and, instead of a single value inserted in the bottom-tier dictionaries, put the list as the value. The structure will look like this:
|
I l@ve RuBoard |
No comments:
Post a Comment