Wednesday, October 14, 2009

The StreamTokenizer Class




I l@ve RuBoard









The StreamTokenizer Class


StreamTokenizer breaks up an input stream into tokens and can be used to parse a simple file (excuse me, "input stream"). Read the Java API documentation on StreamTokenizer, and then compare what you read to the following methods



  • __init__(Reader)


  • __init__(InputStream)


  • nextToken()
    returns the next token in the stream



  • lineno()
    returns the current line number



  • lowerCaseMode(flag)
    returns all words to lowercase if passed a true value



  • parseNumbers()
    sets the parsing of floating-point numbers



  • pushBack()
    pushes the token back onto the stream, returning it to the next nextToken() method call



  • quoteChar(char)
    specifies the character string delimiter; the whole string is returned as a token in sval



  • resetSyntax()
    sets all characters to ordinary so that they aren't ignored as tokens



  • commentChar(char)
    specifies a character that begins a comment that lasts until the end of the line; characters in a comment are not returned



  • slashSlashComments(flag)
    allows recognition of // to denote a comment (this is a Java comment)



  • slashStarComments(flag)
    allows recognition of /* */ to denote a comment



  • toString()


  • whitespaceChars(low,hi)
    specifies the range of characters that denote delimiters



  • wordChars(low, hi)
    specifies the range of characters that make up words



  • ordinaryChar(char)
    specifies a character that is never part of a token (the character should be returned as is)



  • ordinaryChars(low, hi)
    specifies a range of characters that are never part of a token (the character should be returned as is)



  • eolSignificant(flag)
    specifies if end-of-line (EOL) characters are significant (they're ignored if not, i.e., treated like whitespace)




StreamTokenizer's variables are ttype (one of the constant values TT_EOF, TT_EOL, TT_NUMBER, and TT_WORD); sval (contains the token of the last string read); and nval (contains the token of the last number read).


Using StreamTokenizer


Reading the documentation probably isn't enough to get you started with StreamTokenizer, so we're going to work with a simple application that produces a report on the number of classes and functions in a Python source file. Here's the source code:



class MyClass: #This is my class
def method1(self):
pass
def method2(self):
pass

#Comment should be ignored
def AFunction():
pass

class SecondClass:
def m1(self):
print "Hi Mom" #Say hi to mom
def m2(self):
print "Hi Son" #Say hi to Son

#Comment should be ignored
def BFunction():
pass

Follow along with the next interactive session. Afterward we'll look at the code to count the classes and functions.


Import the FileInputStream class from java.io, and create an instance of it.



>>> from java.io import FileInputStream
>>> file = FileInputStream("C:\\dat\\ParseMe.py")

Import the StreamTokenizer class, and create an instance of it. Pass its constructor the FileInputStream instance.



>>> from java.io import StreamTokenizer
>>> token = StreamTokenizer(File)

Call nextToken() to get the first token in the file (that is, class).



>>> token.nextToken()
-3

As you can see, nextToken() returns a numeric value, although you may have been expecting a string value containing "class". In fact, nextToken() returns the type of token, that is, a word, a number, or an EOL or EOF (end-of-file) character, so -3 refers to TT-WORD.


The ttype variable holds the last type of token read.



>>> token.ttype
-3

The sval variable holds the actual last token read. If we want to check if the last token type was a word, we can write this, and, if it was a word, we can print it out.



>>> if token.ttype == token.TT_WORD:
... print token.sval
...
class
>>>

Call nextToken() again to get the next token, which is MyClass.



>>> token.nextToken()
-3

>>> print token.sval
MyClass

Call nextToken() again; this time it should return the '#' token.



>>> token.nextToken()
58

>>> print token.sval
None

Since the token is a ':' StreamTokenizer doesn't recognize it as valid. The only valid types are NUMBER, EOL, EOF, and WORD. So for ':' to be recognized, it has to be registered with the wordChars() method.



>>> token.TT_NUMBER
-2

>>> token.TT_EOL
10

>>> token.TT_EOF
-1

>>> token.TT_WORD
-3

If the type isn't one of these, the number corresponding to the character encoding is returned. Let's see what nextToken() returns for the next character.



>>> token.nextToken()
35

The 35 refers to '#', which you can prove with the built-in ord() function.



>>> ord('#')
35

Get the next token.



>>> token.nextToken()
-3

The token is a word (-3 equates to TT_WORD). Print sval to find out what the word is.



>>> print token.sval
This

As you can see, the StreamTokenizer instance is reading text out of the comment from the first line. We want to ignore comments, so we need to return the tokens we took out back into the stream.


Push the token back into the stream.



>>> token.pushBack()

Attempt to push the token before the last one back into the stream.



>>> token.pushBack()

Set commentChar() to ignore '#'. (commentChar() takes an integer argument corresponding to the encoding of the character.)



>>> token.commentChar(ord('#'))

Get the next token, and print it out.



>>> token.nextToken()
-3

>>> print token.sval
This

Are you wondering why we still have the comment text? The pushback() method can only push back the last token, so calling it more than once won't do any good. Let's start from the beginning, creating a new FileInputStream instance and a new StreamTokenizer instance.


Create the StreamTokenizer instance by passing its constructor a new instance of FileInputStream.



>>> file = fileInputStream("c:\\dat\\parseMe.py")
>>> token = StreamTokenizer(File)

Iterate through the source code, printing out the words in the file. Quit the while loop when the token type is EOF.



>>> while token.ttype != token.TT_EOF:
... token.nextToken()
... if(token.ttype == token.TT_WORD):
... print token.sval

Notice that the comment text isn't in the words printed out.



class
MyClass
def
method1
self
pass
def
method2
self
pass
def
AFunction
pass
...
...


Parsing Python with StreamTokenizer


Okay, we've done our experimentation. Now it's time for the actual code for counting the classes and functions in our Python source code.



from java.io import FileInputStream, StreamTokenizer

# Create a stream tokenizer by passing a new
# instance of the FileInputStream
token = StreamTokenizer(FileInputStream("c:\\dat\\parseMe.py"))

# Set the comment character.
token.commentChar(ord('#'))

classList = []
functionList = []

# Add an element to a list
def addToList(theList, token):
token.nextToken()
if (token.ttype == token.TT_WORD):
theList.append (token.sval)

# Adds a class to the class list
def parseClass(token):
global classList
addToList (classList, token)

# Adds a function to the function list
def parseFunction(token):
global functionList
addToList (functionList, token)

# Iterate through the list until the
# token is of type TT_EOF, end of File
while token.ttype != token.TT_EOF:
token.nextToken()
if(token.ttype == token.TT_WORD):
if (token.sval == "class"):
parseClass(token)
elif(token.sval == "def"):
parseFunction(token)

# Print out detail about a function or class list
def printList(theList, type):
print "There are " + `len(theList)` + " " + type
print theList

# Print the results.
printList (classList, "classes")
printList (functionList, "functions and methods")

Here's the output:



There are 2 classes
['MyClass', 'SecondClass']
There are 6 functions and methods
['method1', 'method2', 'AFunction', 'm1', 'm2', 'BFunction']

The main part of the code (where all the action is happening) is



# Iterate through the list until the
# token is of type TT_EOF, end of File
while token.ttype != token.TT_EOF:
token.nextToken()
if(token.ttype == token.TT_WORD):
if (token.sval == "class"):
parseClass(token)
elif(token.sval == "def"):
parseFunction(token)

Let's look at it step by step.


If the token type isn't equal to EOF, get the next token.



while token.ttype != token.TT_EOF:
token.nextToken()

If the token type is WORD,



if(token.ttype == token.TT_WORD):

check to see if the token is a class modifier. If it is, call the parseClass() function, which uses the StreamTokenizer instance to extract the class name and put it on a list.



if (token.sval == "class"):
parseClass(token)

If the token isn't a class modifier, check to see if it's a function modifier. If so, call parseFunction(), which uses StreamTokenizer to extract the function name and put it on a list.



elif(token.sval == "def"):
parseFunction(token)

StreamTokenizer is a good way to parse an input stream. If you understand its runtime behavior (which you should from the preceding interactive session), you'll be more likely to use it.


The more astute among you probably noticed that functions and methods were counted together in the last example. As an exercise, change the code so that each class has an associated list of methods and so that these methods are counted separately.


Hint: You'll need to use the resetSyntax() method of StreamTokenizer to set all characters to ordinary. Then you'll need to count the spaces (ord(")) and tabs (ord("\t")) that occur before the first word on a line. For this you also need to track whether you hit an EOL token type. (If you can do this exercise, I do believe that you can do any exercise in the book.)


As another exercise, create a stream that can parse a file whose contents look like this:



[SectionType:SectionName]
value1=1
value2 = 3 #This is a comment that should be ignored
value4 = "Hello"

SectionType defines a class of section, and SectionName is like defining a class instance. value equates to a class attribute.


Here's an example.



[Communication:Host]
type = "TCP/IP" #Possible values are TCP/IP or RS-232
port = 978 #Sets the port of the TCP/IP

[Communication:Client]
type = "RS-232"
baudrate = 9600
baudrate = 2800
baudrate = 19200

[Greeting:Client]
sayHello = "Good morning Mr. Bond"

[Greeting:Host]
sayHello = "Good morning sir"

Create a dictionary of dictionaries of dictionaries. The name of the top-level dictionary should correspond to the section type (Communication, Greeting); its value should be a dictionary whose name corresponds to the section names (Client, Host) and whose values correspond to another dictionary. The names and values of the third-level dictionaries should correspond to the name values in the file (sayHello = "Good morning Mr. Bond", type = "RS-232"). If, like baudrate, the name repeats itself, you should create a list corresponding to the name baudrate and, instead of a single value inserted in the bottom-tier dictionaries, put the list as the value.


The structure will look like this:



{}�Communication���{}�� Client {}� type = "rs-232"
| | |
| | - baudrate = [9600, 2800, 19200]
| |
| |� Host {} � type = "TCP/IP"
| |
| - port = 978
|
Greeting �����-----{} - Client{} - sayHello = "Good morning Mr. Bond"
|
|� Host � sayHello = "Good morning sir"








    I l@ve RuBoard



    No comments:

    Post a Comment