E m acs For A ider: A n E xtern al Learning S y stem Jerem y (Ya-Yu) H sieh B .S c., University Of Northern British Columbia, 2003 Thesis Subm itted In Partial Fulfillment Of The Requirements For The Degree Of M aster of Science in Mathematical, Computer, And Physical Sciences (Computer Science) The University Of Northern British Columbia May 2006 © Jeremy (Ya-Yu) Hsieh, 2006 1^1 Library and Archives Canada Bibliothèque et Archives Canada Published Heritage Branch Direction du Patrimoine de l'édition 395 Wellington Street Ottawa ON K1A 0N4 Canada 395, rue Wellington Ottawa ON K1A 0N4 Canada Your file Votre référence ISBN: 978-0-494-28361-5 Our file Notre référence ISBN: 978-0-494-28361-5 NOTICE: The author has granted a non­ exclusive license allowing Library and Archives Canada to reproduce, publish, archive, preserve, conserve, communicate to the public by telecommunication or on the Internet, loan, distribute and sell theses worldwide, for commercial or non­ commercial purposes, in microform, paper, electronic and/or any other formats. AVIS: L'auteur a accordé une licence non exclusive permettant à la Bibliothèque et Archives Canada de reproduire, publier, archiver, sauvegarder, conserver, transmettre au public par télécommunication ou par l'Internet, prêter, distribuer et vendre des thèses partout dans le monde, à des fins commerciales ou autres, sur support microforme, papier, électronique et/ou autres formats. The author retains copyright ownership and moral rights in this thesis. Neither the thesis nor substantial extracts from it may be printed or otherwise reproduced without the author's permission. L'auteur conserve la propriété du droit d'auteur et des droits moraux qui protège cette thèse. Ni la thèse ni des extraits substantiels de celle-ci ne doivent être imprimés ou autrement reproduits sans son autorisation. In compliance with the Canadian Privacy Act some supporting forms may have been removed from this thesis. Conformément à la loi canadienne sur la protection de la vie privée, quelques formulaires secondaires ont été enlevés de cette thèse. While these forms may be included in the document page count, their removal does not represent any loss of content from the thesis. Bien que ces formulaires aient inclus dans la pagination, il n'y aura aucun contenu manquant. Canada A b stract In this thesis, I observed and established communications between EMACS, the external lexical analyzers, and the parser. There are two versions of the lexical analyzers: internal and external. The internal lexical analyzer was implemented in the programming language of E macs -L IS P . The core of the external lexical analyzer was implemented in Flex. For the parser, Bison and Flex were used. This thesis describes (1) details of each component; (2) comparisons of the difference between internal and external information processing; (3) experimental advantages and disadvantages of each version; (4) evaluations of the efficiencies based on running time, memory usage, features presentations, and other resources usage; (5) what needs to be done to achieve such communications; and (6) A ldor syntaxes and grammar. 11 Table of Contents A b stract il Table o f C ontents iii List o f Tables v List o f F igures vi In trod u ction 1.1 W hat is A ld o r ? .......................................................................................... 1.1.1 Lexical Features of ALDOR .............................................................. 1.2 W hat is E m a c s ? .......................................................................................... 1.3 Literature R ev iew ........................................................................................... 1.4 Purpose of my W o r k .................................................................................... 1.5 Emacs C o n cep ts.............................................................................................. 1.6 Version D e t a i l s .............................................................................................. 1 2 4 9 10 11 12 17 T he Internal Lexical A nalyzer 18 2.1 Overview of Building the Internal Lexical A n a ly z e r .............................. 19 2.2 Method ...............................................................................................................20 2.3 Efficiency ........................................................................................................... 26 2.4 C o rre c tn e s s ........................................................................................................ 27 2.5 Problems Encountered and their S o lu tio n s ...................................................27 2.6 Conclusion and Summary .............................................................................. 29 T he E xtern al L exical A nalyzer 31 3.1 Overview of Building the External Lexical A n a ly z e r.................................31 3.2 Method .............................................................................................................. 34 3.3 Communication or P i p e s ................................................................................. 40 3.4 Efficiency ........................................................................................................... 41 3.5 C o rre c tn e s s ........................................................................................................ 42 3.6 Problems Encountered and their S o lu tio n s .................................................. 44 3.7 Conclusion and Summary .............................................................................. 52 111 4 T he 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 A ldor Parser 53 Overview of Building the P a r s e r .................................................................... 54 Method .............................................................................................................. 55 C om m u n icatio n ................................................................................................. 64 Efficiency ........................................................................................................... 66 C o rre c tn e s s ........................................................................................................68 Problems Encountered and their S o lu tio n s ................................................ 68 Future W o rk ........................................................................................................73 Conclusion and Summary ..............................................................................74 5 A id er M od e 76 5.1 Progress and M e t h o d s .................................................................................... 77 5.2 Syntax-based Colouring ................................................................................. 78 5.3 U pdating Algorithms for the ALDOR M o d e ................................................ 79 5.4 F e a tu r e s .............................................................................................................. 81 5.5 Efficiency ...........................................................................................................82 5.6 Problems Encountered and their S o lu tio n s ................................................ 83 5.7 Conclusion and Summary ..............................................................................83 6 C onclusions and Sum m ary 84 6.1 Summary of my Research T im e lin e ............................................................. 85 6.2 W hat can People Learn from my Work? ....................................................86 6.3 Comparison of Lexical Analyzers .................................................................86 6.4 Further Improvement and Future W o r k .......................................................87 B ibliography 88 A A L exical Stru cture for A ldor 92 A .l Lexical S t r u c t u r e ..............................................................................................92 A. 1.1 C h a ra c te rs...............................................................................................92 A. 1.2 The Escape C h a r a c t e r ........................................................................ 93 A. 1.3 T o k e n s ..................................................................................................... 93 A.2 Differences Between the Implemented Lexical Scanner and the A l ­ d o r Lexical S t r u c t u r e ..................................................................................... 96 B C on text Free G ram m ars for A ldor 98 C A b stract S yn tax for A ldor 112 D C odes for m y W ork 117 E U M L D iagram o f E xtern al L exical A nalyzer 118 F U M L D iagram o f Parser 120 IV List of Tables 1.1 1.2 1.3 1.4 Token Categories for ALDOR Tokens ...................................................... E m acs L IS P and F l ex Regular Expressions for ALDOR Tokens. . . E m acs-L IS P and F l ex Regular Expressions for A l d o r Floating Point Tokens...................................................................................................... The Version Details of the Programs U s e d ............................................ 5.1 The Colour Association Used by the A l d o r Mode 5 6 7 16 ................................ 78 List of Figures 2.1 2.2 2.3 2.5 2.6 A Regular Expression to Match A l d o r Reserved W o r d s .............22 A Regular Expression to Match A l d o r String Tokens ...........................23 A Regular Expression to Match A l d o r Intermediate T o k e n s ...... 24 A Regular Expression to Match A l d o r Floating Point Literals Tokens 25 A Regular Expression to Match A l d o r Pre-document Tokens . . . . 25 A l d o r Token P r e c e d e n c e .............................................................................. 28 3.1 3.2 3.3 3.4 3.5 3.6 Definitions for Some A l d o r Tokenizer C o m p o n en ts.................................37 A ctio n s to Take W hen ALDOR Finds a T o k e n .......................................... 38 Timings of the Internal and External Lexical A n a ly z e rs.......................... 43 The a ld o r -p r o c e s s -d o n e -s e n tin e l F u n c tio n .......................................... 47 The a ld o r - r e s t o r e - p o in t - f l i t e r Function .......................................... 49 The a ld o r - g e t - 1 - v a lu e Function ..............................................................51 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 Function th a t Prints the Start and End Point .......................................... 58 Example of Token H a n d lin g ...........................................................................59 Example of B ison Codes for the Update “On The Fly” Algorithm . . 61 Example of Tree Structure in B i s o n ..............................................................63 Timing D ata for Tree Structure P a r s e r ....................................................... 67 Rules to Resolve Ambiguous G ram m ars....................................................... 69 A Tree Structure with a NULL P r o b le m ....................................................... 70 Solution for the Tree Structure with NULL Problem .................................71 A Tree Structure with a Sibling P r o b le m .................................................... 72 Solution for the Tree Structure with Sibling P r o b l e m ............................. 72 E .l UML Diagram for the External Lexical A nalyzer..................................... 119 F .l UML Diagram for the P a rs e r......................................................................... 121 2.4 VI Chapter 1 Introduction There are many different programming languages. As the computer world expands, programmers require more powerful and up-to-date programming languages to pro­ duce effective executable programs. For this reason, some programming languages have been modified and upgraded to keep pace with the current computer program­ ming world (for example, C to and its transformation into C # ) . As a result, some new programming languages are created and some of the old and out-of-date programming languages become less popular. Since there are so many different programming languages, good programming text editors will make an im portant difference in the coding environments for programmers. Good text editors will not only help programmers program faster, and more efficiently, but also improve the readability and quality of program code. I have designed a protocol which helps programmers to expand the language modes in the EMACS text editor. Currently, the language modes which are supported by E m acs are written in E m a c s -L IS P and interpreted by the EMACS text editor itself. I also developed a method which enable the EMACS text editor to interact with external sources. In this thesis the sources are an external lexical analyzer and an external parser. Lexical analysis is the name given to the processing of an input sequence of characters (such as the source code of a computer program) to produce, as output, a sequence of symbols called “lexical tokens” , or just “tokens” . A lexical analyzer makes it possible for EMACS to do syntax-based colouring. Additionally, it provides information which is required by a parser. The parser starts the process of analyzing an input sequence (read from a file or a keyboard, for example) in order to determine its grammatical structure with respect to a given formal grammar. It transforms input text into a data structure, usually a tree, which is suitable for later processing and which captures the implied hierarchy of the input. Generally, parsers operate in two stages, first identifying the meaningful tokens in the input, and then building a parse tree from those tokens. The program m ing language w ith which I chose to dem onstrate th e algorithm s is called A l d o r . T he result of my dem onstration is an ALDOR mode for EMACS. In this chapter I first describe the programming language ALDOR, and then describe what text editors are. My reason for choosing EMACS as my research editor instead of some other text editor are discussed. Following th a t is a survey of previous work. Finally, I finish up this chapter with the explanation of basic components which I need in my work. 1.1 W h at is Aldor? A ldor was derived from another program language, called AXIOM. This language is very young and new; therefore, there is not th a t much related information which one can find from the Internet or in libraries. My main resources for the program­ ming language ALDOR are from A l d o r ’s official site (h ttp ://w w w .a ld o r .o r g /), my supervisor and discussions with some ALDOR development people through email. Additionally, the following references also gave me a better understanding of the programming language ALDOR: “A l d o r User Guide” [27] and “libaldor User Guide and Reference Manual” [3j. Some of the presentation papers and technical reports [2, 26, 24, 22, 23, 7, 25] are useful for programmers to understand the pro­ gramming language A l d o r and implement some simple programs. If a programmer wishes to learn the details of A l d o r , [29] discussed categories in A l d o r and [18] discussed the type system. The programming language ALDOR is a kind of mathematical programming language which was designed to do tasks similar to M a p l e but not like those of F o r t r a n . It is an imperative programming language with first-class types and which supports functional programming. ALDOR also has following characteristics: it has a two-level object model with inheritance (c.f. H a s k e l l ); it allows overloading of symbol names; types and functions are (constant) values; and it has generators, post facto extensions, and other non-standard language features. It has foreign language interfaces for: L IS P , C, G^^, and F o r t r a n 77. Moreover, it provides an automatic garbage collection feature; not to mention the ability to compile to other languages like FO A M , L IS P , or C. The primary considerations for the programming language A l d o r are generality, composability, efficiency, and interoperability. A l d o r looks like both functional and modern programming languages. The structure of the language looks like G^+, J a v a , C, B a s ic , among other modern programming languages — with a semi­ colon at the end of statements, matched parentheses, and control loops. On the other hand, the features it provides and the structure of the language itself are similar to some of functional programming languages, such as SM L, F o r t r a n , and M a p l e . T h e s e fe a tu re s m ak e A l d o r a v e ry good te s tin g la n g u a g e to m ak e u se of. Finally, ALDOR can inter-operate with many other languages, like C, G ^^, J a v a , and others. The program m ing language ALDOR has an LA LR(l) gram m ar. An LALR (Look- Ahead Left to right and produce a Rightmost derivation) parser is a specialized form of LR parser th at can deal with more context-free grammars than simple LR parsers but fewer than LR(1) parsers can [28]. It is a very popular type of parser because it gives a good trade-off between the number of grammars it can deal with and the size of the parsing tables it requires. There are some principles of the programming language ALDOR discussed in [15]. One of these is: “All objects are first-class” . For this reason, programmers can use variables whenever and wherever they wish. It is also legal to write functions inside other functions (nested functions). Another principle is: “Types, both domains and categories, are treated in the same way as any other objects” [15]. In other words, in A l d o r , types get treated just like any other variables or constants. Hence, A l d o r programmers can place type strings on the left hand side of assignment operators and assign some values to them — this is a completely legal operation in A l d o r . This principle increases the difficulty involved in achieving syntax-based colouring for the editor mode. If type strings can be variables, then whether they are syntactically coloured as a type or as a variable ought to depend on a parse and not just on a lexical class. As a result, it is essential to have an A l d o r parser to provide additional information. 1.1.1 L exical F eatures o f A ldor There are some ALDOR tokens which I explain here for clarity in later chapters. In this section, Table 1.1 describes th e token types and Table 1.2 defines A l d o r tokens and their regular expression forms. The character has been treated specially in ALDOR programming language. In other programming languages, the character is treated as part of identifiers; nevertheless, ALDOR treats the character as an escape character. An escape char­ acter followed by one or more white space characters causes the white space to be Token C ategories Category Example D escription Reserved Keywords Definable Keywords Class add case M achin eln teg er Import functions in c lu d e Pre-Document +++ I am Pre-Doc Post-Document ++ I am Post-Doc Comment — I am Comment String "a s tr in g " Identifier id_count Float Integer Definable Operator 1.234e56 3 # The reserved words for ALDOR. Keywords that the user can define. Identifiers th at are likely to identify categories. Identifiers th at are like to identify the names of import functions. Start with three plus signs. Used to document the code, appear be­ fore the code they describe. Can be extracted by autom atic documenta­ tion tools. Start with two plus signs. Used to document the code. Appear af­ ter the code it described. Can be extracted by autom atic documenta­ tion tools. Start with two minus signs. Used to comment the code. Ignored by autom atic documentation tools. String literals. Start and end with double quotes. Represent character data. “Variable” tokens. Must not be re­ served words or operators. Floating point literals. Integer literals. Operators th at can be re-defined by programmers. Reserved operators th a t can not be re-defined. Operators th at may be included in A l d o r in the future. Reserved Operator Future Operator [| Table 1.1: Token Categories for A l d o r Tokens C lassification o f ALDOR tokens L IS P rx S-expression F l e x syntax Reserved Keywords, Definable Keywords, Class, Import functions Hard coded fo r both Pre-Document "+++". * (and (group (repeat 3 "+")) (1+ n o t-n e w lin e )) Post-Document (and (group (repeat 2 "+")) (1+ not-new line)) Comment It— It _ * (and (group (repeat 2 "-")) (1+ not-new line)) String (and (0+ (or (and (1+ space)) (and (not space)) (not (any "_\n\"" )))) (and word-start (or (and "0" word-end) (and "1" word-end) (and (any "a-zA-Z") (* (or (any "a-zA-Z0-9!?l" ) ) ) ) ) ) Iden tifier "0" 1"1" 1 {alpha} ({alpha} 1 { d i g i t } 1 [ ! ? |] )* Float See Table 1.3 Definable Operator, Reserved Operator, Future O perator Hard coded fo r both Table 1.2: EMACS L IS P and F l ex Regular Expressions for A l d o r Tokens. A ldor FLOAT tokens (or (and (* d i g i t ) "." (+ , e s c - d i g i t ) (zero-or-one ,expon)) (and d i g i t (* , e s c - d i g i t ) (* , e s c - d i g i t ) (zero-or-one ,expon)) (and d i g i t (* , e s c - d i g i t ) ,expon) (and , radix (or (and (* ,e s c - lo n g - d ig i t ) It tl (+ ,e s c - lo n g - d ig i t ) (zero-or-more ,expon)) (and (+ ,e s c - l o n g - d i g i t ) Float { d i g i t } * " ." { e s c _ d ig it} + {expon}? 1 { d ig it } + " ." { e s c _ d ig it} * {expon}? 1 { d ig it } + {expon} 1 {radix} {esc _ lo n g _ d ig it}* II {esc_ lo n g _ d ig it}+ {expon}? 1 {radix} {esc_ lo n g_ d igit}+ tl (* , e s c - lo n g - d ig i t ) (zero-or-more , expon)) (and (+ ,e s c - lo n g - d ig i t ) , expon)) ) ) It II {esc _ lo n g _ d ig it}* {expon}? 1 {radix} {esc_lo n g _d ig it}+ {expon} Table 1.3: E m a cs -L IS P and F l e x Regular Expressions for ALDOR Floating Point Tokens. ignored. [27, pp.2Al], It also has the following effects on other tokens. It con­ verts keywords into non-reserved identifiers; it allows visual grouping in integer and floating-point literals; it also allows arbitrary characters to be included in strings and identifiers. Because ALDOR is a programming language for computer algebra it has some unusual tokenization rules for numbers. To begin with, “0” and “1” are identifiers rather than integer literals. This allows domains to define 0 and 1 without being forced to define all integer literals. In ALDOR it is also possible for domains to give new meanings to integer and floating-point literals. These literals may contain a radix (between 2 and 36), which allows a programmer to use different number bases without declaring them specially or doing conversions. However, these powerful features increase the difficulty of recognition of ALDOR tokens. In A ld o r , it is valid for an identifier to contains the symbols “ !” , “?” , or “ I” ; however, none of these symbols is allowed to be the first character of an identifier. The final set of tokens which I want to introduce is pre-document, post-document, comment, and string tokens. ALDOR pre-document tokens are defined as any symbol sequence following three plus (+++) signs. A ldor post-document tokens are defined as any symbol sequence following by two plus (++) signs. A ldor comments are defined as symbol sequences which start with two minus (—) signs. Similar to the way Javadoc handles comments in Java programs, ALDOR pre-document and post­ document tokens can be extracted by some ALDOR tools to create documentation. Finally, a string token is defined as any character sequence which starts and ends with a double quote (") character, where the sequence between the start and end q u o te s c o n ta in s n o u n -e s c a p e d d o u b le q u o te c h a ra c te rs o r new -lines. M ore d e ta ils about A ldor tokens are given in Appendix A. 1.2 W hat is Em acs? “Em acs is the extensible, customizable, self-documenting real-time display editor” — E m acs User Guide ([12]) It is a "real-time" editor because display is updated very frequently, usually after each character or pair of characters typed by a user. (Most text editors do this.) “Self-documenting” means th a t at any time, users can type a special command, to find out what their options are. There are complete built-in information files for the editor and the programming language, as well as keyword accessible help for commands, functions, and options. Furthermore, the documentation is customizable in th at users can add to, delete, or modify the documentation at any point of time. “Customizable” means th at users can change the definitions of EMACS com­ mands. Therefore it can satisfy different kinds of users’ personal habits and pref­ erences. For this reason, users can turn E m acs into a personal text editor which suits them best. This is one good feature which most text editors do not provide. “Extensible” means th a t users or program m ers can go beyond simple custom iza­ tion and w rite entirely new com m ands or program s in EMACS. EMACS is an open source te x t editor, which allows program m ers to modify the core, create extensions for it, and fix bugs. E m acs was chosen for this project because it makes more sense to extend an existing editor, rather than to write a new one. EMACS is not only a text editor which provides many powerful features and is open source, but it also has a large community which uses it. For these reasons, E m acs was the text editor I chose to work on. More details about text editor EMACS can be found in: [4, 20, 21]. 1.3 Literature R eview Before I started to work on this thesis, I first did some research on text editors, lexical analyzers, parsers, communication algorithms and the materials which relate to this thesis. The main idea of my work is for a computer program, particularly a text editor, to communicate with an external source; for this reason, the main focus is on the communication between computers, the programs interaction, and the techniques of information processing. Although I did not find as many references as I expected to help me on this thesis, there are still some remarkable references which I would like to introduce. By reading these references, I was able to settle on the research direction. Furthermore, I gained a great deal of knowledge which relates to this thesis. “Programming on an already full brain” [11] introduced an editor tool called Emacs Menus which helps programmers to develop a program. This paper explains the tools which Emacs Menus provides, and explains their concepts in details. “Practical applications of a syntax directed program manipulation environment” [6] is another reference which I found very helpful. This paper brings in information about a syntax directed editor and abstract representation of data. After reading this paper, I learned about syntax directed editors and some possible features which can be implemented with extra information generated by a parser. The paper “UNIX Emacs: a retrospective (lessons for flexible system design)” [1] helped me to gain a better understanding of EMACS and allowed me to comprehend the concept of the E m acs core. After reading this paper, I was clear on what can be added to the text editor E m ACS. Finally, after I finished this thesis, I found a very new article called “An EMACS mode for A l d o r ” [13] by Ralf Hemmecke. This paper introduced an ALDOR mode for E m a c s . Hemmecke’s A l d o r mode provides a token identification feature which 10 is similar to both of my internal and external lexical analyzers. However, my A ldor mode provides a lot more information to users from my lexical analyzers, the text properties, and a parse tree. In Hemmecke’s A ldo r mode, the parser is not present, and no parse tree is generated. The indentation section in his paper caught my attention. It used the same indentation logic as I implemented here; which is to find the open and close brackets, and then calculate the proper indentation levels. Nevertheless, the A ldor parser in this thesis generates a parse tree. W ith all information provided by my parser, users will be able to implement a more powerful and useful indentation function. In conclusion, even though I did not find a lot of information related to text editor learning; I still gained a great amount of knowledge on other components which are required for me to complete a text editor mode. Although there are some other programming modes for EMACS, none of them were implemented externally and running in parallel with EMACS. Therefore, I concluded th a t the research field is relatively new. On the other hand, there are already a lot or studies done on the components which are necessary to make the whole experiment work. 1.4 P u rp ose o f m y W ork In the current computer world there are many different kinds of text editors for users to write or modify their text files. I have been a programmer for a long time but I rarely find any text editor th at really knows programming languages themselves. For example, all programming text editors allow users to modify program files; most of them do indentation and syntax-based colouring for programming languages; some programming text editors allows users to build or debug programs; and a few programming text editors provide tools for a user to add functions. For example, some programming text editors allow a C'*”''programmer to click a menu to execute 11 a simple command th a t adds a w hile-loop tem plate with proper indentation. Some match parentheses and provide tem plate filling, which allows users to fill in all the required information (such as term ination conditions and the actions this loop presents). However, according to my knowledge, almost none of the “free” programming text editors teach programmers how to program, or assist programmers to program a project. I believe th at most programmers have encountered the situation where they know what they want to do, but were unable to recall the functions they wish to use; or what its syntax is; or whether it even exists. Therefore, a programming text editor th a t helps programmers to program would be very useful. It is not very complicated to provide these features. One can achieve the learning and assistance features by adding a data base which includes all the required information, and then apply some database retrieval algorithms to complete the search operations, and finally return results to programmers. However, no m atter how complete a data base is; some information may be absent from it. Moreover, for a programming text editor to support a new programming language, it will require a “plug in” with a complete data base. It is not a very efficient way to supply languages support features. For these reasons, it would be better if a computer is able to interact with external sources (i.e. those external to the text editor itself) efficiently. 1.5 Em acs C oncepts In this section I explain some basic com ponents of EMACS and some of th e functions I used in this thesis. different. The program m ing languages L IS P and EMACS-LISP are The skeleton of th e tex t editor E m acs is m ainly program m ed in the program m ing language E m a cs -L IS P ; therefore, th e program m ing language which 12 I learned to im plem ent EMACS com ponents is E m a cs -L IS P . T here are a few term s related to EMACS-langLISP th a t I want to explain first to help readers to understand this thesis. In E m a c s , each file opens in a bujfer ([10, pp. 501-516], [20, Chapter 15]). How­ ever, not all buffers are associating with files. A buffer may contains something other than a file. For instance, a shell program can be displayed in a buffer. Additionally, the text displayed in a buffer is not the actual text file on the hard drive; instead, it is a duplicate version. EMACS places buffers in windows, and each window opens in a frame. EMACS gives users the ability to switch between buffers inside a window; furthermore, a buffer may be displayed in more than one window [10, pp. 517-550]. The component which contains one or more windows is called a frame. Users can open many windows in a frame. A frame in the EMACS text editor is the same as a window in many other programs. Therefore, it is very im portant to remember the definitions and the difference between windows and frames in the EMACS text editor. A buffer associated with a file contains a copy of the text on the hard drive. Thus, any modification in a buffer will not affect the original file until the buffer is saved. The methodology of a programming text editor is to open a program file in a text buffer, and allow a programmer to modify text in the buffer. If a user decides to close the buffer without saving, the original file is untouched. If the buffer is saved, the original file will be replaced and updated to match the buffer [10, pp. 551-578]). E m acs also provides built-in autom atic recovery and backup systems. The au­ tom atic recovery system generates files which start and end with “#”s (pound signs). Moreover, users can customize the back up strategy of the autom atic recovery sys­ tem. Normally, E m acs saves the buffer to a tem porary file with the same name as the file plus “#” signs at the begin and end (for example, te m p .tx t will be saved as “# te m p .tx t# ” ). The automatic recovery system can be triggered in many different 13 ways. For example, after a constant number of characters in a buffer have been modified (by default, 300 keystrokes), or a constant amount of time, EMACS will write the buffer to an automatic recovery file. On the other hand, the backup system generates a file which has the same name as the working file plus a tilde at the end of the file name. For example, “te m p .tx t” will be saved as “te m p .tx t" ” . Once the backup system saves the file, the file which was created by the autom atic recovery system is deleted. Therefore, even when E m acs does not crash, the backup system still saves backup files just in case users want to roll back to previous versions. For these reasons, EMACS is a very powerful and safe environment to create a program, edit reports, and do text modifications [10, pp.489-500]. The programming modes in the E m a c s text editor give programmers support for particular programming languages. For example, PASCAL-mode is one of the pro­ gramming modes supplied by the EMACS text editor. Programmers who use EMACS to edit their P a s c a l program would receive some help from E m a c s. Each program­ ming mode provides different support to programmers. Some of the programming modes provide an indentation feature, which re-formats and indents the contents of a program according to scope levels. Such powerful tools require a parser. Some of the programming modes provide compiling features which allow programmers to compile their code directly from the text editor. On the other hand, some of the programming modes only provide a syntax-based colouring feature which helps programmers to identify the types of tokens. Users can write their own mode in E m acs. For more details about programming modes, see the EMACS-LISP Manual [10, pp. 405-439]. Each character position in a buffer or a string can have a teod property list which may contain more than one text property. Each of properties is assigned to character at th a t particular position in buffer. For this reason, the ‘E ’ character at the beginning of the last sentence may not have the same text properties as the 14 ‘E ’ which position at the beginning of the paragraph. Each property has its own name and value, and it is acceptable for many characters to contain the same text property. Copying text between strings and buffers preserves the properties along with the characters. Similarly, moving characters also moves the text properties. Some examples of text properties are the colour or the font of text. The only way to remove properties from a character is to call a remove property function from E macs [10, pp.640-657] . Overlays [10, pp.766-772] are other ways th a t properties can be attached to a buffer. An overlay specifies properties th a t apply to part of a buffer. Each overlay applies to a specified range of a buffer, and contains a property list (a list whose elements are alternating property names and values). An example of an overlay is the text highlighting used for copy and paste in a text editor. Users can select and highlight a region of text, and do some operations upon the text such as copy, replace, delete, or cut. However, applying an overlay does not modify any text properties. In this thesis, the parser appends overlays onto ALDOR code in an E macs buffer. The parser overlays clearly indicate the beginning and ending of blocks, functions, variables, iteration functions, and other useful information. In th e term inology of operating systems, a process is a space in which a program can execute. E m acs runs in a process, and E m a c s -L IS P program s can invoke other program s in processes of th eir own. These are called sub-processes or child processes of a EMACS process, which is their parent process. A sub-process of E m acs may be synchronous or asynchronous, depending on how it was created. W ith a synchronous sub-process, a L IS P program waits for th e sub-process to te r­ m inate before continuing execution. However, an asynchronous sub-process can run in parallel w ith an EMACS-LISP program . This kind of sub-process is represented w ithin EMACS-LISP by an object which is also called a process [10, pp.733-754]. E m a cs -L IS P program s can use this object to com m unicate w ith a sub-process or 15 V ersions N am e Version A ldor E macs Version 1.0.2 and all previous versions Version 21.3.1 (i386-msvc-nt5.1.2600) of 200&3-27 on buffy Version 2.5.4 Version 1.875b Version 3.3.3 (i686-pc-cygwin) Microsoft Windows XP (Home Edition) Version 2002 - Service Pack 2 Flex Bison Makefile and gcc OS (Operating System) Table 1.4: The Version Details of the Programs Used to control it. I used an asynchronous process to construct the external version of the A l d o r lexical analyzer. I used a synchronous process for my A l d o r parser because a parser needs to look ahead and an asynchronous process m ay cause problems. A process sentinel [10, pp. 750-751] is a function th a t gets executed whenever the associated process changes statu s for any reason, including signals (w hether sent by E m acs or caused by th e process’s own actions) th a t term inate, stop, or continue the process. A process sentinel also gets executed if th e process exits. A sentinel runs only while EMACS is w aiting for term inal input, for tim e to elapse, or for process output. T he advantage of this design is to avoid synchronization errors th a t could result from running them at random places in th e middle of other L IS P program s. I used a process sentinel in my external version of th e A l d o r lexical analyzer to determ ine when it had finished. A process filter fu nction [10, pp.733-754] is a function th at receives standard output from the associated process. If a process has a filter, then all output from th a t process are passed to the hlter. The process buffer is used directly for output from a process only when there is no filter. A filter function can only be called when E macs is waiting for something, because process output arrives only at such times. Filters and sentinels are very similar in some cases. I used filter functions to restore 16 the reading point in a buffer to its original place (see Section 3.6). 1.6 V ersion D etails This ends the discussion of EMACS concepts. In next few chapters, we look at internal ALDOR lexical analyzers, external ALDOR lexical analyzers, an external A l d o r parser, and A l d o r mode for E m a c s . Table 1.4 are the versions and builds of the programming language, programs, and applications which I used to establish my work. 17 Chapter 2 The Internal Lexical Analyzer Every programming text editor requires a lexical analyzer. The function of lexical analysis is to produce tokens. Tokens which are generated by a lexical analyzer may not all have the same properties. In my research, my lexical analyzer returns token types according to the source text. A definition of a token is a basic, grammatically indivisible unit of a language such as a keyword, operator, numbers or identifier. For example, the type of the token “if” is a reserved word in ALDOR. Although some programming languages have only a few different kinds of tokens, there are many complicated programming languages th a t contain more than thirty kinds of tokens. In the following sections, there are some examples and explanation about the A l d o r tokens. Lexical analyzers, parsers, and text editors are inseparable. To achieve proper syntax-based colouring, a text editor needs the token types which are classified by a lexical analyzer. On the other hand, a parser requires the token types from a lexical analyzer to construct syntax trees for programs. Therefore, a lexical analyzer is one of the main components enabling a programming text editor to work. There are some programming languages which already are widely used in the computer programming world; for examples: 18 JAVA, BASIC, and PASCAL. For these programming languages, one can easily find many lexical analyzers to perform token classification. On the other hand, there are some programming languages such as A l d o r th a t do not have as many lexical analyzers available. Furthermore, there is not yet a text editor designed to support syntax-based colouring and token identification for the A l d o r programming language. Since ALDOR is not yet widely used, resource and tools which support the language are very limited. 2.1 O verview o f B u ild in g th e Internal Lexical A n ­ alyzer The first step to produce a lexical analyzer for A l d o r is to understand properties of a lexical analyzer. There are many lexical analysis algorithms available, which makes it possible to do comparisons between them. In my opinion, the most impor­ tan t properties of a lexical analyzer are correctness and efficiency. A lexical analyzer should be able to classify tokens in a buffer within a reasonable time. Moreover, high speed classification should not affect the correctness of the results. Addition­ ally, I demand my lexical analyzer to be understandable by anyone who wishes to implement a lexical analyzer based on mine. A lexical analyzer should provide max­ imum support to its users (programmers and text editors) with minimum learning requirements. E m acs already supports many different programming fanguages and presents many programming language modes. Hence, one can study the code which has been implemented within the EMACS text editor to try to understand how to implement an internal lexical analyzer. E m a cs -L IS P provides many built-in functions which help programmers write efficient programs for EMACS. Since it provides so many built-in functions, it is up to programmers to decide how to use the tools they have available. Sometimes, it 19 is very difficult to decide when and how to use them. There are many different kinds of lexical analyzer available, and most of them take different approaches and use different algorithms. Each approach has its own advantages and disadvantages. Thus, understanding the algorithms which imple­ ment lexical analyzers is a very im portant task. Finally, after discussion with my supervisor, I decided on the approach of identifying the ALDOR tokens by regular expression. After I completed the lexical analyzers, tests on the time and space usage were performed upon them to ensure the correctness and efficiency (see [19] about pro­ gramming languages and their memory usage). 2.2 M eth od My internal lexical analyzer is implemented in a file called a ld o r - i n t e r n a l. e l. The lexical analyzer provides functions to traverse a source buffer and add text properties to it based on tokenization information. My first version of an internal lexical analyzer reads from the source buffer and then writes the tokenized result into an intermediate buffer. Finally, it processes the information in the intermediate buffer and puts results back into the source buffer including colour highlighting and text properties. Nevertheless, I was not fully satisfied with the results from the first approach due to its memory space usage. For this reason, I eventually came up with an algorithm which works better and updates the source file “on the fly” . This approach improves my internal lexical analyzer speed by nearly a factor of two. This version of the lexical analyzer fetches a token, finds the type, then updates colour and text properties of the token directly. For this reason, there is no intermediate buffer involved. An intermediate buffer not only wastes resources (memory and space). 20 but also requires a lot of time to process. Since this algorithm is faster, I used it in my lexical analyzer. The combination of all improvements led to my final version of a lexical analyzer recognition algorithm. I first collected all the A ldor tokens and then grouped them into three sub-types which I will discuss later in this section. Finally, I developed optimized regular expressions for each token and wrapped the tokens with EMACS built-in functions to improve the speed of token recognition. There are two scanning functions in my program. One function called aldorscan, requests users to provide names of files th at need to be tokenized. The other function, called aldor-scan-b uffer, scans the currently selected EMACS buffer, runs a lexical analyzer, and applies the syntax-based colouring and text properties to the text buffer. Other components which make my lexical analyzer work are functions called a ld o r -f ind-token and aldor-colour-syntax. The a ld o r -fin d token function calls the defined regular expressions functions and tries to identify a type for a token. Once a token type has been found, a ld o r -f ind-token calls a ld or-colou r-syn tax and passes on the following information: the beginning and end position of the token; the type of the token; and the source buffer reference. The aldor-colou r-syntax function uses the information it has to insert desired colours and apply appropriate text properties to the token string. One thing to notice, a lexical analyzer changes how the information in the buffer is displayed, not the contents of the file. As a result, the source file will not be modified unless the user decides to save the buffer and overwrite the actual file. However, results of a lexical analyzer are not be saved. In other words, EMACS will not save any of the text properties when it saves a buffer. I use regular expressions to perform token recognition and handling. I also use an explicitly loaded function called rx from the EMACS package. This function will translate an S-expression (a balanced-parentheses expression) to a regular expression 21 (se tq aldor-reserved-word (rx (and word-start (or "and" "assert" "add" "always" [ . . . see Appendix A for a complete l i s t . . . ] "then" "return" "rule" "select" "while" "where" "to" "try" "yield") word-end))) "break "throw "with" Figure 2.1: A Regular Expression to Match A ldor Reserved Words string. I separated the tokens into many different categories. Then I wrote one regular expression for each token. The A ldor reserved words are the simplest category to implement because reserved words are defined as fixed strings. The E macs -LISP code in Eigure 2.1 defines the regular expression for the A ldor reserved word token strings. Function s e tq assigns the value in the second argument to the first argument. In my case, the first argument is an ALDOR reserved word type; and the second argument is the value which is returned by the rx function. The rule for reserved words is quite simple as it requires three elements to return a reserved word: a word-start token; a keyword declared within an “or” block; and a word-end token. Both word-start and word-end tokens are predefined by E macs (for details about these functions, please refer to the E macs -LISP manual [10, pp. 687-710]). There are many other types of tokens which are declared in a similar way to reserved word tokens. The following token types fall in this category: Reserved key­ words, Definable keywords, Class keywords. Import Function keywords. Future operator. Reserved operators, and Definable operators. The second kind of tokens are identifiers, integer literals, floating-point literals and any tokens which cannot be defined by a fixed list of strings. A regular expres- 22 (se tq a ld o r -s tr in g (rx (and "\"" (0+ (or (and (1+ space)) (and (not space)) (not (any "_\n\"" )))) ))) Figure 2.2: A Regular Expression to Match A ldor String Tokens sion is the perfect technique to handle these tokens. Among all token types without a fixed form, the String token is the simplest type to handle. Figure 2.2 contains E m acs-L IS P codes which defines the regular expression for such a token. An A ldor string is defined as a sequence of symbols which starts and ends with double quotes. Between these two double quotes any sequences of character, word, sentence, and space are allowed but double quotes and new line characters are unacceptable. Moreover, the underscore symbol can be displayed in an A ldor string as long as the following rules apply: 1) an more whitespace characters; 2) an is followed by one or is followed by exactly one non-white space character (possibly a double-quote), or 3) an is followed by any character other than an underscore, a double quote, or a newline character. Hence, it is legal for a string token to contain one or more underscore symbols. Among all token types without a fixed form, the floating-point literal token is the most complicated to define. Since ALDOR supports floating-point literals in many forms, the regular expression for such a type is not simple. Figure 2.3 includes a collection of intermediate regular expression which are required not only for floating­ point literal tokens, but also other numeric literal tokens. A ldor floating-point type tokens can only be represented by very complicated regular expression in EMACS-LISP code. After a few attem pts and discussion with my supervisor, I finally designed the E m acs-L IS P code which define A ldor a ld o r -f lo a t type tokens (see Figure 2.4). 23 ; ; Some b a sic and intermediate regular expressions th at are ;; used in other regular expressions more than once. e s c - d i g i t ’ (and (zero-or-one d ig it) lo n g - d ig it ’ (any "0-9A-Za-z") ; CAN be replace by " l e t t e r " . . . . e s c - lo n g - d ig i t ‘ (and (zero-or-one , lo n g - d ig it ) radix ‘ (and d i g i t (0+ , e s c - d i g i t ) "r") expon ‘ (and (any "eE") (zero-or-one (any "+-")) (1+ , e s c - d i g i t ) ) Figure 2.3: A Regular Expression to Match A ldor Intermediate Tokens I would also like to briefly explain an E macs -LISP symbol ‘ (back quote) and r x - t o - s t r in g . The function r x - t o - s t r in g parses and produces code for regular expression which are written in S-expression regular expression form. On the other hand, the back quote function translates everything in its matched parentheses to a list without evaluating the contents. For this reason, the expression which is defined by the back quote function can be substituted into other regular expressions by adding a comma in front of it. For more details about EMACS-LISP functions and symbols, please refer to the Regular Expression sections in the EMACS-LISP Manual [10, pp.687-710]. Additionally, for details about EMACS-LISP declarations for integer literals and other token types without fixed forms, please refer to the documentation and comments in the a ld o r - i n t e r n a l. e l file. The third kind of declaration category contains ALDOR Pre-Document, P ostDocument, and Comment token types. All three kinds of token types start with some repeated special symbols and end with a new line character. The Pre-Document token type is defined by any string after three plus (+++) symbols (see Figure 2.5). The new operators which I now introduce are repeat and group. A repeat operator takes two arguments, the number of repetitions of the symbol and the repeating symbol. A group operator groups the repeating symbols together. In 24 The fo llo w in g regular expression defined the f lo a t in g -p o in t type token: { d i g i t } * " . "{esc_digit}+{expon}?I { d i g i t } * " . "{esc_digit}*{expon}?I {digit}+{expon}I {r a d ix } {e sc _ lo n g _ d ig it} * " . "{esc_long_digit}+{expon}?I { r a d ix } {e sc _ lo n g _ d ig it} + " . "{esc_long_digit}*{expon}?| {rad ix}{esc_lon g_digit}+{expon } The fo llow in g i s the Emacs-LISP code fo r "float" (se tq a ld o r -f lo a t ( r x - t o - s t r in g ‘ (or (and (* d i g i t ) (+ , e s c - d i g i t ) (zero-or-one , expon)) (and d i g i t (* , e s c - d i g i t ) (* , e s c - d i g i t ) (zero-or-one , expon)) (and d i g i t (* , e s c - d i g i t ) , expon) (and , radix (or (and (* , e s c - lo n g - d ig i t ) (+ , e s c - lo n g - d ig i t ) (zero-or-more , expon)) (and (+ , e s c - lo n g - d ig i t ) It tl (* , e s c - lo n g - d ig i t ) (zero-or-more , expon)) (and (+ , e s c - lo n g - d ig i t ) , e x p o n )))))) Figure 2.4: A Regular Expression to Match A ldor Floating Point Literals Tokens (se tq aldor-pre-document (rx (and (group (repeat 3 "+")) (1+ n o t-n e w lin e )) ) ) Figure 2.5: A Regular Expression to Match A l d o r Pre-document Tokens 25 the case of the ALDOR Pre-Document token, three plus signs (+++) are grouped together. Post-Document and Comment tokens are implemented in a similar way. The difference is th a t Post-Document tokens starts with two plus signs instead of three; while Comment tokens starts with two minus signs (—). The above is summarized in Table 1.2. 2.3 Efficiency This section discusses the efficiency of the update “on the fly” approach. As com­ puter speed improves, modern computers are able to process the entire source buffer before a programmer implements a simple function in a buffer. Hence, the timing efficiency becomes a less serious threat for overall efficiency. For a small ALDOR program buffer, the space and time efficiencies are almost inconspicuous. Once I am dealing with a buffer which contains a huge A l d o r program, then space usage and time efficiency become very important. For the updating “on the fly” approach, extra memory usage is close to zero. It is very safe to declare th a t the space usage per line will always stay constant; and the size of a buffer will not affect the space efficiency at all. The next issue is the time usage. In this version of the internal A l d o r lexical analyzer, each token is updated “on the fly” . One of the greatest advantages of this approach is buffers get updated almost instantly. There is no time wasted in the transformation from an original buffer to a reorganized buffer. An internal ALDOR lexical analyzer which uses this approach updates two thousands lines of A l d o r codes within two seconds on my machine (Windows XP base machine, an AMD 24004- CPU, 512 MB of RAM, and no other applications running besides E m a c s ). In conclusion, the outcome for the time and space efficiency of my internal AL­ DOR lexical analyzer is acceptable. In Chapter 3, I include a data chart which 26 contains all th e tim ing for this approach. 2.4 C orrectness A lexical analyzer should at least correctly identify tokens in a buffer. I ran my internal lexical analyzer against many different ALDOR files. All ALDOR tokens turned out to be classified correctly. In other words, all tokens were identified as expected. In summary, I have completed a lexical analyzer which provides good perfor­ mance with acceptable resource usage. My requirements for the program were th a t it must not only serve the purpose of what it was designed for, but also th a t it uses all resources correctly. The final version of my internal lexical analyzer met all the requirements which I had. 2.5 P roblem s E ncountered and their Solu tion s In this section, I discuss problems which I encountered during the development of my internal lexical analyzer. There are two major problems which I would like to explain. The first problem is the priority problem and the second one is the token type identification problem. During the development of the internal lexical analyzers for ALDOR, I noticed the priority problem. The priority problem is due to bad design of token priority. For example, I would not wish to see any reserved word classified as an identifier. Since my program identifies tokens by regular expression, sometimes there is a tie situation. In this case, my program uses the first rule which was declared. In the example, if the identifier token type was declared before the reserved words token type; the token would be classified as an identifier. For this reason, an identifier 27 Type R eserved D efinable K eyword (]LASS Import F unction P re -D ocument P ost -D ocument C omment S tring Identifier F loat Integer D efinable O perator R eserved O perator F uture O perator P r io r ity B B B B A A A A E C D B B B Figure 2.6: A ldor Token Precedence token type should have the lowest priority among all token types. Although my lexical analyzer tries to match the longest possible sequence in a buffer, I still need to write multiple separate regular expressions to recognize the various kinds of tokens, and prioritize those regular expressions to ensure th a t they are tried in an order th a t guarantees matching the longest sequence. For this reason, I developed a priority chart for the ALDOR programming language (See Figure 2.6). The most critical problem which I encountered is the “0” and “1” symbol problem (token identification problem). According to A ldor syntax, 0 and 1 should be treated as identifiers, but their type may change depending on the surrounding context and on the effect of explicit or implicit im port statements. For this reason, both of them can possibly be an identifier, an integer or a floating-point number, a function name, a procedure, or a string (and any other type which is valid in A ldor ). A s a result, classification of the type of these two symbols becomes very complicated. To determine the exact token type for 0 and 1 would required a parser. In conclusion, a lexical analyzer has absolutely no way to find the type of 0 and 1; 28 finding their types is as difficult as parsing the program. For this reason, in both internal and external lexical analyzers, I have treated these two tokens as identifiers. There were m any problem s and bugs which I encountered while I developed my internal version of th e lexical analyzer for ALDOR. Most of them were resolved by doing more research, re-designing th e program , and asking other people. This section discussed some of th e common problem s which everyone may encounter during design tim e. Furtherm ore, this section may help other program m ers who wish to w rite lexical analyzers make decisions on when to use w hat approach. This section should help others not to make th e same mistakes which I m ade while I was developing th e program. 2.6 C onclusion and Sum m ary In conclusion, the internal version of the lexical analyzer which I designed for the programming language ALDOR worked very well. It is working faster than my original expectation and it is also very resource efficient. Although there are some minor details I would like to change, the outcomes should satisfy most people overall. Definitely, this lexical analyzer has set the foundation for my future research and aldor-mode components. Additionally, implementing a good lexical analyzer is a very im portant step before implementing parsers. For a programmer who wishes to build parsers for any of the programming languages, a good lexical analyzer will surely help them in their later work. Another im portant thing which I learned from this exercise is to think generically. I tried to make my lexical analyzer very flexible, understandable, and easy to modify. As a result, anyone who studies this thesis should be able to implement an internal lexical analyzer with syntax-based colouring and token identification features for any programming language. 29 Like lexical analyzers for other program m ing language modes, my internal lexical analyzer is im plem ented in E m a cs -L IS P and uses E m acs regular expressions. A lthough it is not much different from pre-existing lexical analyzers, it may be easier for others to read. This ends the discussion of my internal lexical analyzer. In the following chapter we look at an external lexical analyzer th a t communicates with EMACS via pipes. In Chapter 6 we compare the lexical analyzer presented in this chapter with the one presented in the next chapter. 30 Chapter 3 The External Lexical Analyzer In the last chapter, I described internal lexical analyzers for the programming lan­ guage A l d o r . In this chapter, I discuss external lexical analyzers. I believe th at independent programs are more generic and more flexible. I compare both internal and external lexical analyzers through comparisons of; time, memory, efficiency, resource usage, and usefulness. 3.1 O verview o f B u ild in g th e E xternal Lexical A n ­ alyzer As mentioned, I worked on both external and internal lexical analyzers simultane­ ously. For this reason, I learned a lot from both versions during development. One of the advantages to this strategy is th a t if I could apply the same logic to both lexical analyzers, then I would be able to compare the differences in results. From all of these results, I eventually found the best method for the internal and external version of the program. The primary challenge with an external lexical analyzer is program-to-program communication. This research also compares the internal and external lexical ana31 lyzers for efficiency and ease of implementation. For this reason, establishing correct and efficient communication is one of the most im portant components for this por­ tion of the research. For two people to talk to each other, they must first agree to use the same language to talk. Otherwise, they will not be able to understand each other. Even if they do not share the same language, they must have some common form of communication in order to be able to exchange words, thoughts, or ideas with each other. For example, Japanese and Chinese share some of the same sets of characters. Hence, it is possible for people from these two nations to communicate with each other. Some other ways of communicating are sign language, drawings, or simply finding a translator. Similarly, for two or more programs to communicate with each other, there must be some way for them to talk or communicate their needs. In the case of my external lexical analyzer, there are pipes (intermediate media) which are created by EMACS in order to make communication possible. Once an intermediate medium has been established, the remaining issue is for those programs to speak the same language. Just like with humans, the only way for computers or programs to interact with each other, is to exchange information in the same language. The language th a t my external lexical analyzer uses to communicate with E m a c s consists of fists of L IS P expressions th a t contain three arguments. The arguments of the expressions are a beginning point, an end point, and the type of an identified token. To fetch a list of three-tuples from an intermediate buffer, I implemented the functions a ld o r -r e a d -fr o m -list and a ld o r -g e t- l-v a lu e . The latter function r e tu r n s one th re e -tu p le L I S P e x p re ssio n to t h e a ld o r - r e a d - f r o m - lis t function. The information is then used to complete all remaining tasks of the lexical analyzer. Since EMACS is implemented in E m a c s - L I S P , I believe it would more be efficient to process information in a list of partial E m a c s - L I S P code. 32 Unfortunately, sharing the same language is not enough to establish communi­ cation. At a later stage of my research, I realized the timing is also very im portant. For example, suppose there are two people who want to meet in a park. For one reason or another, one person is late. As a result, the other person may assume th at his friend will not be coming, forgot or such, and decides th a t he should go home. Hence, the meeting of these two people is cancelled. Similarly, I also ran into the same problem with my analyzer: an EMACS-LISP program process term inated and assumed scanning was finished while an external lexical analyzer was still scanning the source file. Since the program is running outside of EMACS, it needs time to scan the source buffer and to generate the information which can be read by the E macs program, and only then could it give all the information to the EMACS program. The process function of EMACS sends a request to execute an external lexical analyzer and waits for results to be returned. How long should EMACS wait for results? How long will it take for a lexical analyzer to return the results? W hat is the best and most efficient way to handle this communication and waiting problem? All of the above questions presented challenges in completing this external lexical analyzer. Thus, the research I did for this version of the lexical analyzer is heavily concentrated on these problems. One thing I realized was th a t all methods which I applied in an external version also applied to the internal version, whereas the methods which I applied to the internal version often could not be applied to the external version. The reason for this is th a t there are more limitations on an external version of an analyzer. The challenge of establishing a tokenizing method for an external lexical analyzer is not the implementation; it is primarily the problems of communication and timing. 33 3.2 M eth od The key for building internal lexical analyzers for E macs is to use resources effi­ ciently. Internal versions of programming language lexical analyzers are heavily fo­ cused on the combination of function efficiency and optimized resource usage. Since all information is provided inside E macs itself; I was able to focus on the optimiza­ tion of the lexical analyzers’ performance. Development of an external analyzer for E macs is a completely different experience. An external analyzer for a particular programming language fetches the contents of an EMACS buffer and processes the information it gets. W hen an external analyzer finishes its tokenizing process, it sends results back to E macs in a format which can be understood and processed by E macs . Once E macs has received this information, it is able to complete the scanning job, ultimately returning or displaying the appropriate result in its buffer. Obviously, the first challenge which I came across is transferring information. When I did my research, I found a few lexical analyzer programs such as Python and Flex. For these programs, programmers provide tokenizing rules, and these programs will generate results as set of tokens. Among all of these programs, I chose a tokenizer language called Flex (see [16], [17] for more details). For complete information on Flex, please refer to the official Flex website, or consult the man pages provided by UNIX. Here are a few reasons for using Flex to implement my lexical analyzer. Firstly, it is very well documented. For this reason, the logic and techniques used by Flex to do the tokenizing process were understandable. There are other programs similar to Flex which do a comparable job; but, do not have documentation as good as th a t of Flex. Secondly, 1 tested a variety of these programs, and Flex was one of the few programs which was able to get the job done in a reasonable amount of time with accurate results. Since 1 tried to make the entire program work efficiently — its components being the external lexical analyzer for 34 A l d o r and E m a c s — the running time of my external analyzer is very im portant. Thirdly, the implementation required for F l e x is fairly simple. It is very close to C in programming style, a language I am proficient in. This made it very easy to learn and then write a good rule tokenizing program. For these reasons, I decided to use F l e x as my external end of the pipe. One of drawbacks which I soon become conscious of is the lack of flexibility of a lexical analyzer generated by F le x . In general, a generic program such as F l e x is less powerful than a program created to meet a specific need, such as for a specific language. Clearly, when I write a program, I can customize results th a t I want, along with all the information th at I need. Moreover, I am able to modify and shape a program to suit my needs. Due to the limitation of F le x , I spent lots of time on researching the syntax, logic, and implementation of F le x . Instead of letting F le x generates results in the format which I wish for, I had to make EMACS be able to understand the information which was created by F le x . At th a t point in time, my main challenge was to attain communication between EMACS and F le x . Somehow, I had to develop a middle stage such th a t both ends of the pipe would be able to talk to, and understand each other without any mistranslations or misunderstandings. The implementation of the ALDOR lexical analyzer using F l e x did not take too much time. Instead, it was the process of getting tokenizing rules th a t proved to be more difficult. Although I had already completed rules for the internal version of the lexical analyzer, the trail which led to the completion of an external lexical analyzer was still cloudy. F l e x has its own rules for regular expressions and, moreover, the implementations for token recognition rules are very different between EMACS, an in te rn a l version; a n d F le x , th e e x te r n a l v ersio n . B asically, EMACS a n d F le x h av e different representations of regular expressions. All token rules needed some minor adjustment due to this difference. As a result, I spent many hours testing the correctness of F l e x defined tokens, and correcting the errors. 35 The first set of rules which I put into operation were some basic intermediate regular expression (Figure 3.1). Additionally, Flex allows me to define the actions to take when a token is found. Figure 3.2 contains some examples of how to define actions for Flex when a token has been found or classified. By this example, one can easily observe th at actions have been defined as a token identification followed by a function to handle the token. These actions are w ritten in the format used by C. Earlier I covered all the different token cases and actions which are taken when some token has been classified. At the beginning of my example, when a “ws” (white space) symbol has been identified, my program then increases variable in it_ p (defined by me) by one. If a “n l ” (new line) token has been classified, the program should takes no action with regards to the in it_ p variable. If a reserved word or constant symbol string token is found, the program then returns results to the program which called this Flex program. Fortunately, F lex gives programmers flexibility to not only return results or complete a single action, programmers can actually run a series of actions inside an action block. For this reason, I can call pre-defined functions of C, such as strn c p y or user defined functions, like p r i n t _ l i s p . From this, a programmer is able to generate results in a preferred format by simply writing format functions. The last piece of F lex is the driver program which is once again implemented in C. After a few different attem pts and improvements I implemented my final version with a new function, p r i n t _ l i s p , which prints the results to a file with a format th a t E macs -L IS P is able to understand. In the final version, I cleaned up most of the extra statem ents and optimized the token classification rules for ALDOR. Therefore, the running time for my external lexical analyzer is very reasonable. The information which is provided by the p r i n t _ l i s p function is: the starting position, ending position, and type of a token. Notice th a t the function p r i n t _ l i s p returns types of tokens instead of the actual token strings. The E macs text editor does not 36 esc _ alpha one_digit d ig it e sc .d ig it l o n g .d ig it esc_ lo n g _ d ig it radix in t expon flo a t [A-Za-z°/o] [2-9] [0-9] {d ig it}I{e sc } {d ig it}I[A -Z ] { lo n g _ d ig it } I { e s c } { d ig it} { e sc _ d ig it} * " r " { o n e _ d ig it } I { d ig it} { e s c _ d ig it } + I { r a d ix } { e s c _ lo n g _ d ig it } + [eE ][+ -]? {es c _ d ig it} + { d i g i t } * " . "{esc_digit}+{expon}?I { d ig it } + " . "{esc_digit}*{expon}?I {digit}+{expon}I { r a d ix } {e sc _ lo n g _ d ig it} * " . "{esc_long_digit}+{expon}?I {ra d ix } {e sc _ lo n g _ d ig it} + " . "{esc_long_digit}*{expon}?I {rad ix}{esc_lon g_digit}+{expon } dot future_word future_op definable_op_word definable_op "d ela y" I " f i x " I " i s " I " i s n t " I " l e t " I "rule" [ \ [ { ( ] "I "I" I" [ \ ] } ) ] I [' &] I " I I " "by" I " ca se" I "mod"I "quo" I "rem" [ \ - \ \ # * + or <=>@~\"] I " « " I " » " |"<=" |">=" I "<-" ">"I " \ \ or " I" or \ \ " I "**"I" I t t l I t I I - ' — I t id "0"I"1"I { a l p h a } ( { a lp h a } |{ d ig it } I [!? I] ) * quote ["‘ "] string_char ["_‘ ‘ "\n]I esc[~ \n ] ws \ t I " " n l \n["\015"] pre_doc "+++".* post_doc "++".* comment "— ".* s tr in g \" [~ \" ]* \" Figure 3.1: Definitions for Some ALDOR Tokenizer Components 37 ++init_p; {ws} { n l} add always and a sse r t break but { p r in t _ lis p ( { p r in t _ lis p ( { p r in t _ lis p ( { p r in t _ lis p ( { p r in t _ lis p ( { p r in t _ lis p ( in it _ p . in it_ p . in it_ p . in it.p . in it_ p . in it _ p . yyleng. yyleng. yyleng. yyleng. yyleng. yyleng. "aldor..reserved", "aldor..reserved", "aldor..reserved", "aldor..reserved", "aldor..reserved", "aldor..reserved", "ADD" ) ; } "ALWAYS" ) ; } "AND" ) ; } "ASSERT" ) ; } "BREAK" ) ; } "BUT" ) ; } ( l i n e s omitted) stdoutI newline Ispace I include I rep Iper IPer \ { strncpy(tok en String, y y t e x t ,2 0); p r i n t _ lis p ( in it _ p , yyleng, "aldor_import_function", y y te x t); \} /* pre_doc */ {pre.doc} \ { s tr n c p y (to k e n S tr in g ,y y te x t,20); p r in t _ lis p ( in it _ p , yyleng, "aldor_pre_doc", y y te x t); \} ( li n e s omitted) It J . I I II II { p r i n t _ lis p ( in it _ p , yyleng, { p r in t _ lis p ( in it _ p , yyleng, { p r i n t _ lis p ( in it _ p , yyleng. /* Future ops */ "LI" { p r in t _ lis p ( " { I" { p r in t _ lis p ( "(I" { p r i n t _ lis p ( "11" { p r in t _ lis p ( in it _ p , yyleng, in it _ p , yyleng, in it _ p , yyleng, in it _ p , yyleng. ’aldor_res_op", 'RBRACE" ) ; } 'aldor_res_op", "VERY_EQUAL" ) ; } 'aldor_res_op", 'BAR" ) ; } 'aldor_future_op", 'aldor_future_op", 'aldor_future_op", 'aldor_future_op", Figure 3.2: Actions to Take W hen A ldor Finds a Token 38 'LLBRACK" ) ; } 'LLBRACE" ) ; } 'LLPAREN" ) ; } 'RRBRACK" ) ; } need to know the actual token; it is more helpful to know the type of the token. Once I finished my study on functionality provided by Flex, I implemented a function called a ld o r-s c a n in EMACS, which takes one argument, the name (or path) of a target file, sets all the required variables, then starts the communication with my external lexical analyzer and generates the result. I then used an interme­ diate buffer, which allowed the two programs which I created to communicate with each other. In other words, an intermediate buffer acts as the bridge connecting both sides of the programs. This intermediate buffer called “m id d le .tx t” (Note: the name of this buffer is customizable, and can be any name programmers wish it to be). Once I secure the connection, I also need a variable which will store and call the procedure. In my case, the procedure is the external lexical analyzer for ALDOR. Finally, I have to let my a ld o r-s c a n function know where to output results. For a lexical analyzer, putting results back into the source buffer is the most direct and efficient way to notify users what changes the lexical analyzer made. For this reason, the result buffer is same as the source buffer. One of the mistakes I made in an earlier stage was missing a file existence check. Sometimes, users may enter invalid paths or file names; as a result, the whole program will crash due to being unable to locate the source file and update the information back to the buffer. Fortunately, EMACS has autom atic file creation abilities. If a file does not exist, E m a c s will create a new empty buffer named as the user requested. To fix the file non-existence error, I added error checking. Since E m a c s already h a s a built-in function to handle error checking, it was not too hard for me to implement. All it took was some thinking to consider all the possible p ro b lem s. Once I initialize all the required components, I then call my external lexical analyzer from the a ld o r-s c a n function. I also have another function called a ld o r r e a d - f r o m - lis t to read results from the intermediate buffer. Coming to a comple­ 39 tion, the a ld o r - r e a d - f r o m - l i s t procedure calls the a ld o r - c o lo u r - s y n t a x proce­ dure to add additional properties to each token. These properties are associated with buffer text itself. Therefore, if text is deleted, then all the properties are also deleted. Additionally, if text gets moved to another location, properties will also move to the other location. One of the biggest advantages of associating the prop­ erties with text is th a t it gives us additional information. In my case, users are not only able to identify the types of tokens by colour highlighting, but they are also able to see the types of tokens displayed in the “echo” area of the EMACS text editor as they scroll their cursor over tokens. Furthermore, programmers can always add in more properties to token text to help them implement more advanced features. A fter deciding on th e algorithm and im plem enting my external lexical analyzer, I then did a fair am ount of tu n e up and problem solving on my external version of th e A l d o r lexical analyzer. 3.3 C om m unication or P ip es As mentioned in an earlier chapter, the key for program communication is to unify language and synchronize the timing. If programs are able to agree on one protocol and use the same format then mutual understanding between programs is really not th a t problematic. In my research, I used a lists of three-argument E macs L IS P expressions as the intermediate language between an external lexical analyzer and the EMACS text editor. My external lexical analyzer works as follows: the E macs text editor first sends its request to an external program and expects to learn something from it. Secondly, the external program generates results in an intermediate language and feeds those results to an intermediate buffer which was initially created by the EMACS text editor. Finally, the EMACS text editor takes the results from the intermediate buffer and learns from the results. From this algorithm 40 we can see th a t the most efficient implementation will have the external program generate results in E m a cs -L IS P code (or some code which can be evaluated by E m a c s ) and have E m acs compile code from the intermediate buffer. To summarize, the intermediate language is code in E m a cs -L IS P . Therefore, E m a cs can simply evaluate results one-by-one from an intermediate buffer. Please refer to Appendix E for the UML diagram of the lexical analyzer. On the other hand, synchronizing the timing is also very im portant. In the computer world, as long as communication is involved in a given application, then timing is an im portant issue. For example; in any network which requires secure communication, an uncoordinated communication, or timing problem may result in an error, duplicated data, or transmission problems. In parallel computing and distributed systems, timing is also a very im portant component. In this thesis, I have a waiting time of 0.1 seconds for my EMACS scanning program. Therefore, results are updated within a reasonable time. The program basically looks into an intermediate buffer for token information. If the intermediate buffer is empty, then it waits for one tenth of a second (0.1s), and attem pts to read from the intermediate buffer again. Therefore, eventually, it will be able to identify entire tokens in a source file and associate all desired properties to tokens. 3.4 Efficiency I performed my tests on a Windows XP based machine, with an AMD 2400+ CPU and 512 MB of RAM. I used an internal E m acs elapsed time function to time the processing time of my programs. During the timing process, since I wanted to get the best and most accurate result. I did not run any other applications other than E m a c s . I performed the same test three times, and averaged the results. However, the timing for all three times is very close. For this reason, I did not perform any 41 more tests of timing. I only performed my tests on one machine. In my concern, the main focus was to do the timing comparison between the internal and external lexical analyzers. As a result, it is pointless to perform tests with a faster or slower machine. Overall, I will get the same comparison result on their timing ratio. I timed the processing time for both internal and external lexical analyzers. Figure 3.3 is the data and time table for internal lexical analyzer versus external lexical analyzer. From the figure, it is easy to observe th a t an internal lexical analyzer runs faster than an external lexical analyzer. 3.5 C orrectness There might be some problem which will affect the correctness of the lexical ana­ lyzer. For instance, the longest sequence matching rule should always be applied to the lexical analyzer. If I do not follows the longest sequence matching rule, a token scanned later may overwrite a previously classified token. The basic idea is correct; an algorithm proposed is an over-write algorithm to avoid some problematic situa­ tions. For instance, in a Pre-Document token, there should not be any other token type in it. Since “fo r ” is one of the reserved words for ALDOR, it would be wrong if such text string gets identified as a reserved word inside a Pre-Document token. On the other hand, the overwrite algorithm also raises a problem. For example, suppose we have following string: +++ I love ++, and — l i k e s to program. Assume the token scanning order is Pre-Document, Post-Document, and then Comment. As a result, I get a Comment token insides a Post-Document token; and a troublesome Post-Document token inside a Pre-Document token. Therefore, the scanning process is not correct. 42 s s §111 I I V U I g% e C I? il! # 8 8 8 3 9 8 C, *= ('oes} eiiiii Bu|uun%j Figure 3.3: Timings of the Internal and External Lexical Analyzers My scanning algorithm implemented in F lex seems to be correct for the cases it is designed to handle. This version of the lexical analyzer tries to match up a token by the longest match rule. Hence, there will be no longest sequence match problem for my lexical analyzer. For example, it handles the above example string with the longest match token type it can classify; therefore, such a string will be classify as 43 a Pre-Document token — which is what I expect. In conclusion, correctness is more im portant than time efficiency. Generally, any lexical analyzers should sacrifice its efficiency for a correct result. Overall, it is more im portant for a program to do its job right and serve its purpose. 3.6 P roblem s E ncountered and their Solutions Similar to most of my experiments, the very first version of my external lexical analyzer failed dreadfully. It was a total disaster as the outcome not only crashed my computer, but also modified my entire program file containing all of the lexical analyzer functions. I quickly noticed th a t I had not set my path correctly. As a result, instead of modifying a source buffer, the program ends up modified itself. Luckily, I always keep a back-up copy of all of my work. Thus, I did not use too much time to recover my lost work. Eventually, I realized the problems I encountered with the buffer were not the only problems. I also encountered difficulties with my intermediate buffer. An intermediate buffer should be cleared every time before its use. Otherwise, an in­ termediate buffer will have a lot of extra useless information. Worst of all, some of the extra information may lead to a wrong result or even crash the program. Mismatched information may also lead E m a c s to update a source buffer with infor­ mation th a t is meant for another buffer. Furthermore, if the other buffer is longer than the source buffer, then EMACS will try to put those erroneous properties be­ yond the end of the source buffer which then causes termination of the program. E m a c s has to do all of its work according to information which is provided by an external lexical analyzer. Therefore, if the token information is not correct, then E m a c s will not be able to do its job accurately. The solution which I used was to delete the intermediate buffer as soon as the 44 entire source buffer has been scanned and all properties have been associated. In my program, there is no file actually being created; as per the discussion in the introduction, a buffer is different from a file. I used a buffer which is generated by E macs to store all information which is then returned by an external lexical analyzer. At the end of my program, I removed the buffer. Hence, the cleanliness of the intermediate buffer has been assured. The problems which I encountered above were very simple compared to the timing and communication problems which I met later. The major challenge when it comes to the timing and communication problems is to nail down actual problems. Any mistake could result in program term ination, a blank result, or a faulty result. Of all the work I did, I spent the m ajority of my time debugging problems, learning of potential problems, and trying to find a solution to fix those problems. Initially, when I ran my program and nothing happened, there might be many pos­ sible causes. Moreover, there were many places which may have problems. For this reason, it was very hard to find out where the problem is and what caused it in the first place. I used a debugger in an attem pt to find problems in a systematic process; since there are no other reasonable ways to nail down the problem. Although such a process takes a long time to make any progress, my hard work eventually paid off. I finally found the first bug — a communication bug. Basically, my external lexical analyzer did not send results to an intermediate buffer. As a result, the interme­ diate buffer was empty, and E m a c s was unable to do any work on the source file. The solution which I found is to use another function called s t a r t - p r o c e s s . I also created variables to hold all of the required information. Please refer to Appendix E for the UML diagram for the external lexical analyzer. Once I figured out th a t problem, and changed my approach, the intermediate buffer writing problem was solved. I ensured th a t all of the results got written to an intermediate buffer in the format which I requested, a format which could 45 be understood by EMACS. However, even though results had been written to an intermediate buffer, my program still did not work. I found out th a t there was a series of problems in my program — all of which led to failures of my program. I spent a few months discovering and correcting these problems. The most critical problem which I met was not being able to know what went wrong. As mentioned before, the same error may be due to many different reasons in many different places. During the debugging period, I modified a portion of correct code countless times. I encountered several different problems at the same time, and I realized th a t in order to make the whole program work, I must fix every single problem in the program all at once. As a result, everything which I previously implemented for my program seemed to be wrong. I had changed many instances of correct logic to wrong logic, and my program yielded even worse results. I organized problems into two major categories: timing problems and non-timing problems. As mentioned above, timing is very im portant when programs attem pt to communicate with each other. In the computer world, every micro-second may lead to a completely different result. Therefore, I spent a lot of time arranging the order of function calls as well as working on timing — trying to connect results with time very neatly. During communications, the major problem is a waiting problem. A waiting problem is also known as an EMACS timing problem. Such a problem is caused by the uncertainty of total waiting time required by the external lexical analyzer to read from the source buffer and write results to an intermediate buffer. As mentioned in the last paragraph of Section 3 .3 ,1 solved the problem by adding in the waiting time of one tenth of a second (0.1s). Therefore, I can always assume the programs at both ends of the pipe are running and the results are always processed by E m acs and generated by the external lexical analyzer. As soon as I fixed th e waiting problem a new problem arose. I ran into another w aiting problem — a w aiting for th e result problem. Since EMACS has a one te n th of 46 (defun a ld o r -p ro c e ss-d o n e -se n tin e l (process s tr in g ) (when (memq (p r o c e ss-sta tu s process) ’ ( e x it s ig n a l clo sed n i l ) ) (se tq aldor-scanner-done t ) ) ) Figure 3.4: The a ld o r -p ro c e ss-d o n e -se n tin e l Function a second (0.1s) waiting time to fix the read or write problem caused by an external lexical analyzer; it is always waiting for new results. Even though my external lexical analyzer had already finished and term inated, EMACS still waited for more results from my external lexical analyzer. This problem also took me a long time to figure out. Once the problem was identified, it was very easy to provide a solution. The solution I came up with is to check the status of my external lexical analyzer procedure (see Figure 3.4). The sentinel function which I wrote is called a ld o r -p ro c e ss-d o n e -se n tin e l. The sentinel tracks the status of the process (external lexical analyzer). This sen­ tinel is triggered if the state of the process changes. If the status changes to any of following states — e x it, sig n a l, c lo se , or n il; then it sets the variable the aldor-scanner-done to t (true). The aldor-scanner-done variable has boolean type. The purpose of this variable is to determine whether my external lexical analyzer procedure has completed, exited or is in any of the term inating states. Another critical problem which I encountered is the era se-b u ffer issue. This problem is caused by an intermediate buffer having been deleted or removed be­ fore my external lexical analyzer starts to scan the source buffer. For this reason, once my external lexical analyzer finishes scanning the source buffer, it cannot find an intermediate buffer to write results to. I overcame the e ra se -b u ffe r problem by re-implementing my entire program, changing the order of function calls, and introducing the a ld o r -p ro c e ss-d o n e -se n tin e l function which I discussed above. One of the side effects of a timing problem is the delay problem. A delay problem is caused by bad timing in E m a c s . The function which reads from an intermediate 47 buffer is called a ld o r -r e a d -fr o m -lis t. The problem was th a t my program termi­ nated before the a ld o r -r e a d -fr o m -list function was called. In other words, the E macs procedure finished my program before an external lexical analyzer wrote its result to an intermediate buffer. Fortunately, the a ld o r -p ro c e ss-d o n e -se n tin e l function also fixed this problem. My a ld o r -p ro c e ss-d o n e -se n tin e l function en­ sures th at the E macs program and a ld o r -r e a d -fro m -list function do not term i­ nate before the procedure (external lexical analyzer) is complete. On the other hand, there are also many other non-timing problems which re­ sulted in my program not working properly. One of the most basic and im portant observations is th a t some commands in EMACS have side effects. Therefore, it is very im portant for E macs -LISP programmers to know side effects for each action they take. Any E macs -LISP programmer who does not understand the side effects, would likely have bad result. If E macs-LISP programmers comprehend the side ef­ fects of functions, then the side effects will certainly become very powerful features. For example, inserting a string into a buffer is not simply just inserting a string into a buffer; it also has the side effect of moving point which stores the current position or location of the text cursor. Furthermore, an insert function also has a side effect on point-max. This function returns the last text position or location of a buffer which is also known as the end of the buffer position. As another example, the function called save-excursion is one of the functions which has the side effect of restoring states of a buffer. The most critical problem which I encountered of non-timing problems is a “point” problem. A “point” problem occurs when point, the current text cursor location, has not been set properly. There are many different variations of problems which arise from the point problem. For example, one of the problems I encountered has “point” passed point-max (the end of a buffer); as a result, the program either runs an infinite loop, as it is not able to reach the end of the buffer; or returns a 48 (defun a l d o r - r e s t o r e - p o i n t - f i l t e r (proc str in g ) (w ith-current-bu ffer (p ro cess-b u ffer proc) (save-excursion (goto-char (process-mark proc)) ( in s e r t str in g ) (set-marker (process-mark proc) ( p o i n t ) ) ) ) ) Figure 3.5: The a ld o r -r e s t o r e -p o in t-f i l t e r Function program exception error. For another example, during the waiting time (100 ms) for EMACS, it is the only chance and opportunity to read the information from an intermediate buffer (similar to the Reader and W riter problem). However, once my external lexical analyzer resumes its work, it will then continue to write information into an intermediate buffer until either the entire source buffer has been scanned, or 100 milli-seconds waiting period expires again. Regretfully, as my external lexical analyzer writes results into an intermediate buffer, it has the side affect of moving point, the current cursor location to another position, causing E macs to not know where it should resume its information reading. Additionally EMACS cannot start from a new point; otherwise, it will result in a section of code without desired properties. The solution which I developed is to add in a filter which would restore point to its original place — reset it to exactly where it was located before new in­ formation was inserted. save-excursion. The function which makes this filter possible is called A save-excursion function remembers the original state be­ fore the codes inside its scope has been executed. As a side effect of insertion, the s ta te , point, and max-point are modified. Figure 3.5 is the actual implementation for the filter. The logic for this function is very simple. Firstly, it sets the current work­ ing buffer to the intermediate buffer. As mentioned, setting the working buffer correctly is very important; otherwise, it may end up modifying my programming 49 buffer or source file instead of an intermediate buffer. The second step includes using a save-excursion function to store the original state before any changes have been made. Thirdly, the program moves the current working text cursor to wherever process-mark of the process is currently located. Usually the value of process-mark is located at the end of buffer. This function then inserts result strings into an intermediate buffer. As mentioned, the side effect of insert function moves point. As a final step, the function sets process-marker to wherever point is. After exiting the filter, save-excursion function restores the point to whatever position it was in before entering the filter. Therefore, point will remain correctly positioned. Some of the functions which I used to solve problems include a ld o r -c o lo u rsyntax function and a ld o r -g e t- l-v a lu e function. The ald o r-colo u r-sy n ta x function solved the problem for properties which had not been inserted properly. Additionally, I made many different attem pts to improve the efficiency for this func­ tion. In the end, I decided to use an EMACS built-in function, pu t-tex t-p ro p erty , to implement this function. There are two versions of the a ld o r -g e t- l-v a lu e function. One of my ap­ proaches passes a marker as the argument; the other approach, which I used for my final version, passes a buffer (intermediate buffer) as the argument. A success­ ful a ld o r -g e t- l-v a lu e was implemented after several failures. The logic for this function can be broken down into many smaller pieces (see Figure 3.6). Initially, the a ld o r - g e t - l- v a lu e function sets variable value to an intermediate buffer or nil according to the state of the intermediate buffer. Next, this function initializes the done variable to n i l . After which, this function loops through the intermediate buffer until my external lexical analyzer has completed and no more values can be read from the intermediate buffer. At this point, this function sets the variable done to the value of either aldor-scanner-done which was set within 50 (defun a ld o r - g e t - l- v a lu e (buffer) ( l e t ((value (c o n d itio n -ca se n i l (read buffer) (error n i l ) ) ) (done n i l ) ) (while (and (not value) (not done)) (se tq done (or aldor-scanner-done (n u ll (accept-process-output n i l 1)) ( s i t - f o r 0 .1 ) )) (se tq value (c o n d itio n -ca se n i l (read buffer) (error n i l ) ) ) ) v a lu e )) Figure 3.6: The a ld o r - g e t - l- v a lu e Function the a ld o r -p ro c e ss-d o n e -se n tin e l function or the value returned by the built-in function s i t - f o r (returning true means th at at least a portion of the input has been read); or to the negation of the function accept-process-output. Finally, the function sets the variable value to either the result from the read buffer, or nil if there were any errors. In conclusion, the sentinel and filter functions are the keys to solving most of my timing as well as non-timing problems; without them, this lexical analyzer would not work. A sentinel notifies a program of a change of status of the process. Therefore, a program is able to know when a scanning process is complete, and whether or not it is an appropriate time to read from an intermediate buffer. On the other hand, a filter function helps to set the point properly for programs, which solves the most critical non-timing problem, the “point” problem. Fixing problems actually took less time than figuring out the location and cause of the problems. Overall, the most challenging part of my whole research is a combination of timing and communication 51 problems. 3.7 C onclusion and Sum m ary In conclusion, this part of the program has been the longest and most challeng­ ing portion of my entire graduate term. Nonetheless, the purpose of my research is mainly based on computer communication and EMACS learning from outside sources. Since I encountered so many different tribulations, I gained great experi­ ence in the area of debugging communication problems and building techniques to solve these bugs. Moreover, I learned how to handle the timing problem and was actually able to think through most of the possible solutions. The outcome of my work is extremely gratifying. The external lexical analyzer for A ldor is working smoothly. However, the most im portant success of this work is not the completion of an external lexical analyzer; instead, it is how to connect the communication components between programs. Please refer to Appendix E for the UML diagram of my external lexical analyzer. This chapter discussed external lexical analyzers, and their concepts. The next chapter describes development of an external parser. My external parser uses the F lex language as well. However, the implementations and involved components are completely different. 52 Chapter 4 The Aldor Parser For a programming text editors, a parser is one of the most im portant components. If a lexical analyzer is the heart of a programming text editor; then a parser is the arms and legs. In other words, without a lexical analyzer, it is impossible to implement a programming text editor with any tools. It is very easy to construct a text editor; however, it is not as easy to create a programming text editor. The major difference between these two types of text editors is th a t a programming text editor actually has to know the programming language in order to be able to help a programmer. On the other hand, a plain text editor does not require any information of its contents. Therefore, a plain text editor will not be as much help to programmers. Nevertheless, there are some limitations to a lexical analyzer. A lexical analyzer has the ability to classify tokens. Furthermore, some of lexical analyzers have the capacity to present an indentation feature for programming languages. However, it will never be able to provide some useful features which a parser can provide. For instance, a lexical analyzer cannot help programmers to identify syntax errors; nor can it help programmers to find the start and end locations of a function, type of a variable, loops, or scope bounds. 53 For these reasons, it is possible to complete programming text editors without implementing a parser (just as people can survive without their legs and arms); however, with additional information supplied by a parser, a programming text editor will become more powerful and have the ability to help programmers to code their programs (with arms and legs, people can do more). 4.1 O verview o f B u ild in g th e Parser As discussed in the introductory chapter, A l d o r has a reasonable number of gram­ mar rules. This was one reason I decided to use ALDOR as my target programming language. One of main challenges I face is an ambiguous grammar problem. There will be more discussion of this topic in a later section of this chapter. Moreover, it is very difficult to find more information related to A l d o r since this programming language is relatively young, and not as popular as other widely used programming languages. One of the main objectives which I tried to reach is separating a processing text editor from its reference resource. Any ordinary text editor will have a chance to work for particular programming languages with the minimum modifications. For this reason, 1 decided to implement my parser externally. Similar to my external lexical analyzer for ALDOR, the communication between a parser and E m acs is one of the major challenges. The communication which involves a parser is more complicated than an external lexical analyzer. For an external parser to work, it requires three-way communication between the EMACS text editor, an external ALDOR lexical analyzer, and an external ALDOR parser. 54 4.2 M eth od There are many references on the Internet th a t provide information on programming language parsers. The implementations are very different between a parser and a lexical analyzer (see [8], [14]). Parsers requires much more information about programming languages than lexical analyzers need. One of the reasons for this is th a t for every parser, there must be a lexical analyzer to provide token information. As a result, it is very im portant to understand communications between a parser and its lexical analyzer. I did my external lexical analyzer for A l d o r in the programming language called Flex. Hence, I mainly concentrated research on parser programming languages which matches Flex tokenizer programming language. Since Flex is a very popular tokenizer generation programming language, it is quite easy to find a parser language which will be able to communicate with Flex and accept the results which are generated by Flex. Some parser generating languages which support Flex are YACC and Bison (see [9], [5]). Bison is an upgraded version of YACC; they have almost identical syntax and almost identical functionality. However, Bison is newer and provides more features than YACC. The prim ary goal which I tried to establish is finding the communication methods and algorithms between Bison and Flex. Regretfully, as I gained more knowledge about the algorithms which establish the communication between Bison and Flex; it became very clear to me th at I must re-implement my entire external lexical analyzer. In other words, I have to re-implement my external lexical analyzer and change the algorithms which I used. As I implemented the parsers, I suddenly realized th a t Bison does not support the grammar which is provided by A l d o r official site. See Section 4.6 for details. Eventually, I was able to make Bison to accept A ld o r gram m ar and started 55 the implementation of an A l d o r parser. When I implemented my external lexical analyzer in Flex, I had not prepared it to work for a parser. Therefore, I re­ implemented the Flex part of the program to create a lexical analyzer th a t generates the information which is required by a B iso n parser. After many different approaches and comparisons, I combined some algorithms and implemented my ALDOR parsers. There are two major versions of parsers which I constructed. A tree structure algorithm parser generates a parse tree th a t contains all parsing information. The second parser used an update “on the fly” algorithm. This parser generates results much faster. However, the information which can be retrieved from an update “on the fly” algorithm parser is very limited. For instance, such parser does not generate any knowledge about a programming language struc­ ture. Thus, it is very hard for one to use it to implement an indentation feature. An update “on the fly” parser processes information directly to a buffer without knowing any details of parsing structure. It just follows the grammar and returns parsing scopes once a scope can be identified. Finally, after some discussion with my supervisor, and survey with other pro­ grammers on their preferences; I decided to implement a tree structure parser. Al­ though it runs slower than an update “on the fly” parser; it is able to provide more information and help for programmers. Nevertheless, I would like to discuss the concept of the update “on the fly” algo­ rithm as it will be easier to discuss the difference between these two very dissimilar parsers and their algorithms. The update “on the fly” algorithm is an one-pass parsing process. B iso n generates a tree bottom up so it is not as easy to create an one-pass parse tree. This version of parser returns results as it move from one grammar rule to another grammar rule. As long as the parser can parse its source file and succeed with some grammar, then it will trigger its functions to return re­ sults. On the other hand, a parser which is implemented in this algorithm is simpler 56 compared to other parsing algorithms (not to mention the more complicated tree structure algorithm). For an update “on the fly” parser to work, there are some required variables. Similar to an external ALDOR lexical analyzer, a position variable is needed to store token positions. However, the concepts and algorithm of the position variable is a little bit different for this parser. From the view of bottom level, each token which is generated by the lexical analyzer is a term inating node for this parser. On the other hand, there are many non-terminal nodes. The basic concept for this algorithm is for a parent scope to fetch the starting point of its first child and the end location of its last child. There are a few possible cases. Firstly, if it found a token, then it is in a term inating rule. For this reason, it does not have any children; thus, the starting and ending point of its children are the same as its own starting and ending point. Secondly, if it is a non-terminal rule, and it has only one child, then the start and end point for this particular non-terminal rule will also be exactly the same as its child. Finally, in a rule which has more than one child, then the starting point for this scope will be the start point of its first child; and the end position will be the end position of its last child. Additionally, the start and end point are inherited from the bottom level. The key parts of the update “on the fly” parser was implemented in two flies called aldor_uof_parser. le x and aldor_uof_parser.y. The token communica­ tion mechanism between these two parts is internal. Since Flex and Bison made a very good parser package. Flex is able to generate the information which Bi­ so n needs. After so m e m o d ific a tio n s a n d co d e im p le m e n ta tio n , I am able to make aldor_uof_parser. le x application generates tokens and pass require information to aldor_uof_parser .y applications. Following th at, aldor_uof_parser .y gener­ ates some Emacs-LISP code and output results to an intermediate buffer. Finally, 57 in t in it_ p = 1; /* Function p r in ts the s ta r t p o in t, end point and the type of a token */ void s e t _ l i s p ( int& s t a r t , in t leng ) { y y l l o c . first_colum n = s ta r t; y y l l o c . f i r s t _ l i n e = line_num; y y l l o c . last_column = s t a r t + leng; y y l l o c . l a s t _ l i n e = line_num; s ta r t += leng; } Figure 4.1: Function th at Prints the Start and End Point I had E macs fetch the E macs -LISP code from an intermediate buffer, then execute and process all information in the intermediate buffer. Please refer to Appendix F for the UML diagram of the parser. The language used in an interm ediate buffer is EMACS-LISP. In other words, Bison returns a com plete EMACS program to an interm ediate buffer. For th is reason, E macs only needs to wait for th e com pletion signal from an external parser, and th en execute th e program which is in an interm ediate buffer. Figure 4.1 shows the implementation of a function which set the start and end position for an update “on the fly” parser. Similar to an external lexical analyzer, the in it_ p variable acts as a position pointer which used to store the position of the current token. The y y l l o c - variables are Flex built-in variables which help Bison to get the start and end position along with the first and last line numbers. As mentioned in Section 4.2, Flex and Bison are a working pair and thus their variables are designed to fit each other perfectly. Figure 4.2 contains some applications of s e t _ l i s p to various cases. My s e t . l i s p function stores information which Bison needs to defines a ter­ minal rule. At the end of each Flex definition, it returns a token type. Therefore, 58 /* examples for some of the tokens */ "mod" { s e t _ l i s p ( in it _ p , y y le n g ); return MOD;} "quo" { s e t _ l i s p ( in it _ p , yyleng); return QUO;} "rem" { s e t _ l i s p ( in it _ p , yyleng); return REM;} /* example for some of the non-constant tokens */ /* import fu nctions */ s td o u t Inewline I space I include I rep Iper IPer { s tr n c p y (to k e n S tr in g ,y y te x t,2 0 ); s e t _ l i s p ( in it _ p , yyleng); return ID; } /* example fo r some of the operators */ /* Other ops - temporary colour as future ops*/ { s e t _ l i s p ( in it _ p , yyleng); return DCOLON;} { s e t _ l i s p ( in it _ p , yyleng); return DOT;} ":*" { s e t _ l i s p ( in it _ p , yyleng); return COLON_STAR;} II J It { s e t _ l i s p ( in it _ p , yyleng); return PLUSMINUS;} Figure 4.2: Example of Token Handling 59 Bison is able to term inate in a terminal rule and finish parsing. Figure 4.3 has some examples of Bison codes which interact with the information returned by Flex. A p rin tT re e function prints out a message which can be executed by EMACS. @ $.first_colum n and 0 $ .last_ co lu m n are Bison variables which find th e s ta rt and end position of its child or children. The implementation for a tree structure parser is completely different. There are many different files involved in order to construct the algorithm. The re­ quired files are as follows: g lo b a l.h , u t i l . h , u t i l . c , a ld o r _ p a r s e r . le x , and a ld o r_ p a rs e r .y. Please refer to Appendix D for instructions about how to find the complete code for these files. In order to build a parse tree, I first categorized the grammar and developed an abstract syntax grammar for ALDOR. I then defined tree nodes so source code can be broken down into many nodes, with each of these nodes having the ability to connect to other nodes. It was a very long task to develop and design such a nodes list. I managed to categorize all nodes into seven major types. Moreover, for each of the major node types, I divided them into several smaller sub-node types. The main challenge was to define exactly the numbers of required node types. If there are too many different type of nodes, a parse tree will become too complicated and the parsing process will be very slow. On the other hand, if there are not enough node kinds, parts of a parse tree will not be connected and the result is an incomplete parse tree. This leads to a node shortage problem. For the ALDOR grammar, the maximum number of children required in a parse tree is three. For example, for the “i f ” statem ent the first child dehnes conditions for the statem ent; the second child defines actions to be taken if the evaluation of conditions turns out to be true; and the third child is an “e l s e ” statem ent, which defines actions to be taken if the evaluation of first argument turn out to be false. Regardless of the number of children, a tree node also needs a sibling link which 60 /* For the code on the parser sid e - Bison code */ /* Here are some examples */ cases : binding_C ollection { prin t_ o verla y ( @$. first_colum n, @$.last_column, " ca ses" ); } casesOpt: binding_C ollection { p rin t_o verlay ( @$. first_colum n , 0 $ .last_column, "casesOpt"); } /* empty */ { prin t_o v erlay ( @$.first_column, 0 $ .last_column, "casesOpt"); } id: ID { prin t_o v erlay ( ®$.first_colum n, @$.last_column, " id " ); } I POUND { p r in t_ o v e r la y ( @$.first_column, @$. last_column, "id " ); } I TILDE { prin t_o v erlay ( ®$. first_colum n, 0 $ .last_column, "id " ); Figure 4.3: Example of Bison Codes for the Update “On The Fly” Algorithm 61 links to the next node; and requires another variable to store such a link. There are differences between a sibling node and a child node. A sibling node lies on the same level as its sibling. On the other hand, a child node is one level beneath its parent node. A child node’s scope is a subset of its parent’s scope; while a sibling node’s scope is independent from other sibling node’s scopes. Keeping this distinction makes indentation possible. As I developed an abstract syntax for ALDOR, I inserted a kind of node called “TempNode” . “TempNode” will not display in a parse tree, nor will it apply an overlay to a source file. In other words, certain nodes generate scopes which I did not find useful to users. For this reason, I replaced these kinds of nodes as “TempNode” . However, my implementation for tree nodes is very flexible and easy to modify. Programmers can replace one or more “TempNodes” by some new type of nodes which they declare. However, they must then insert and modify some files. As a result, new node scopes will be displayed in a parse tree if found, and overlay scopes will be applied to the source code. Similar to the update “on the fly” algorithm, a tree structure algorithm also needs the s e t _ l i s p function to store the start and end locations for all non-terminal and terminal rules (tokens). On the other hand, in Bison, every rule must create its own node according to its type in order to construct a parse tree. Fortunately, I already spent a lot of time studying ALDOR and had resolved the ambiguity of A l d o r s ’ grammar. Hence, I did not have a tough time classifying node types for the rules. As soon as a node is created, variables of the node will store all the information. Finally, I attach a sibling or children to the node if they exist. There are some B iso n variables which I used to build parse trees. The variable $$ represents nodes. Variables $1 and $3 represent th e first and th ird node of their parent node. The “newTempNode(TK)” function and other different kinds of node generator functions are defined in th e files “u t i l .h ” and “u t i l . c” . Figure 4.4 shows 62 infixedE xprs: infixedExpr { $$ = $ 1 ; } I infixedExpr COMMA infixedExprs { $$ = newTempNode(TK); $$ -> s t a r t_ p o s itio n = 0 $ .first_column; $$ ->end_position = 0 $ . last_column; $$ ->token_type = "infixedExprs"; $$ -> ch ild [0] = $1; $$ - > c h i l d [ l ] = $3; } Figure 4.4: Example of Tree Structure in Bison one example of my tree structure parser algorithms. After many re-implementations and corrections, I was finally able to get my parser to parse an ALDOR program properly. The parser program which I designed not only has the ability to parse A l d o r programs, but also presents a syntax error checking feature. For this reason, my parser is able to notify programmers of syntax errors with their location and line number. I did once think of implementing error recovery features; but it does not fit into my research topic. For this reason, error recovery features are left as future improvements. Lastly, there are two versions of output for parsers which use the tree structure algorithm. One of these parser prints results as a syntax tree while the other parser generates E m a cs -L IS P code, so th a t it can communicate with the EMACS text editor. For the parser, EMACS does not use filters or sentinel functions. It simply waits for the signal which is generated by the external parser, then evaluates the inter­ mediate buffer (aldor_parse_m iddle.el) which contains results from my external 63 parser. By evaluating the intermediate buffer (ald or_p a rse_m id d le. e l ) , EMACS inserts overlay properties on the source buffer according to information which is generated by the parser. The concept of an overlay is different from th a t of text properties. Overlay properties belong to a region of a buffer while text properties belong to a text string. For this reason, if one moves a text token, EMACS also moves the text properties associated with the text token; however, the same action will not affect an overlay. Additionally, an overlay region expands as soon as the programmers insert more text into the region. In conclusion, both versions of parser provides parsing information. However, each parser has its own advantages and disadvantages. An “update on the ffy” parser returns the result faster than a tree structure parser. On the other hand, a tree structure parser provides more information to the user than an “update on the ffy” parser. Therefore, although both parsers serve the same purpose, parse the source program; but they can be apply differently according to users’ needs. 4.3 C om m unication The method which I used to establish communication between an external parser and E m acs is different from the method which I used for my external lexical analyzer and E m a c s . The major difference between these two methods is how they handle the intermediate buffer. The external ALDOR lexical analyzer produced semi-EMACS-LISP code. As a result, E m acs m ust read results one by one from th e interm ediate buffer and process each of these results individually. T he external parser, however, generated pureE m a cs -L IS P code. For this reason, EMACS can evaluate th e entire interm ediate buffer all at once. Moreover, communication with my parser program is three-way communication. 64 As mentioned in Section 4.2, a Bison parser program acts as an intermediate pro­ gram to bridge communication between a lexical analyzer and EMACS. One of the main focuses of this parser exercise is to learn communication methods which involve more than two programs. The communications between a lexical analyzer and a parser program is as fol­ lows. A Flex lexical analyzer identifies tokens and returns them one by one. The Bison parser, requests tokens one by one from the Flex lexical analyzer. This process continues until every token in a source buffer has been scanned. Communi­ cation between Bison and Flex is a series of function calls. Bison starts a loop and terminates only if all t e x t in the source buffer has been processed, the parser has parsed the initial goal, and a tree has been successfully built. During the procedure of Bison creating a tree, it calls the Flex program repeatedly until termination. The Flex lexical analyzer returns one token at a time to the Bison parser and stays at the beginning of the next token, ready for the next request from Bison. How­ ever, if the lexical analyzer reaches the end of the buffer, it continually returns an end-of-file token. Finally, when Bison finished, it generates the final results in an intermediate buffer if the source buffer can be parsed correctly; or it returns syntax error messages to notify users th at the parsing process failed. The outcome of this exercise is acceptable and the greatest shortcoming was a lack of speed. As I used an external lexical analyzer and an external parser, it is not unreasonable for my program to have a long processing or parsing time. Even though I tried some other methods, the main predicament is still related to communication. As the size of the programs grow, communication between these programs will increase as well. For this reason, it will take longer for a driver program to receive results from these external programs. In conclusion, there is a chain reaction in the communication network. As each program is delayed for a short period of time, the whole package ends up becoming 65 inefficient. For this reason, it is very im portant to design a good algorithm for pro­ grams to communicate with each other without waiting for each other. Nevertheless, some waiting is inevitable. 4.4 Efficiency The overall efficiency for my external parser is fair. The external parser program has acceptable memory resource management results; however, the time resource management is not as good. Firstly, space efficiency had been done well for the update “on the fly” parser. For this version of the parser, the only extra memory storage required is an intermediate buffer. On the other hand, a tree structure parser requires a lot of extra space to store its nodes, node types, and all the information which is associated with the nodes. For this reason, the tree structure parser has the worst space efficiency over the entire collection of parsers which I designed. Secondly, the time efficiency for update “on the fly” parser is good. Since this algorithm updates an intermediate buffer almost instantly; an update “on the fly” parser has the best time management among the parsers which I developed. How­ ever, this parser program still has to wait for its lexical analyzer program to finish classifying tokens. The tree structure parser is even slower because it has to con­ struct a tree and assign value to its variables. For instance, it take about one minutes to parse 1400 lines of A ldor code. Hence, I do not claim th at the time efficiency is good for any of the external parsers. Figure 4.5 shows the timing data for my tree structure parser. Finally, the usefulness of result are very clear. An update “on the fly” parser only provides basic overlay information to E m a c s , making the usefulness for an update “on the fly” parser poor. On the other hand, the usefulness for a tree structure 66 Run time comparison 450 400 IT 350 1 300 E 250 Parser :2o, I Poly. (Parser) 150 3 100 K 0 100 200 300 # of co p ies of source co d e (1 copy = 23 lines of code) Figure 4.5: Timing D ata for Tree Structure Parser parser is acceptable. According to my research, a tree structure parser provides the maximum amount of information compared to any parser algorithm. In other words, there is no more information which a parser could supply other than returning a complete tree structure with information on each node. In conclusion, time efficiency is the most critical drawback for a tree structure parser. On the other hand, a tree structure parser has the ability to provide maxi­ mum information to its requester. Therefore, it is very hard for one to judge which of these algorithms is the better approach. An update “on the fly” parser will satisfy the users who look for speed and some basic parsing information. Conversely, most of the programmers may care about functionality, as long as the running time is reasonable. As a result, I implemented the parsing process as an on-call function, instead of an auto-load function. Consequently, programmers can parse any ALDOR 67 program through A l d o r m ode in EMACS at any tim e they wish. 4.5 C orrectness The parsers correctly identify the start and end position of each rule block. Thus, E macs is able to apply overlay regions correctly to the source buffer. Additionally, a tree structure parser has the ability to print out an entire parse tree if there is no syntax error in the program. Otherwise, it will print the location and the line number of the syntax error. Moreover, I reduced the rules and developed an abstract syntax grammar for the A ldor programming language. As a result, I successfully cut the node count from three thousand four hundred to ninety nine (for p a r s e r _ s im p le . as) without affecting the tree structure. The abstract syntax grammar is a grammar which is developed from the concrete syntax. It has fewer rules but is still able to parse a program properly. I also successfully reduced overlay insertion commands in an intermediate buffer by removing any unneeded nodes in Bison. In conclusion, there is no possibility for a Bison generated parser to go wrong when all of information produced by the F lex lexical analyzer is correct. Some possible errors may be caused by not correctly setting the token position or by not properly setting programming language grammar rules. Since I already developed a tree structure parser, other programmers can just use my programs as a tem plate and modify the grammar rules. 4.6 P roblem s E ncountered and their Solutions The A ldor team has their own parsing process. They pre-process the parame­ terized grammar by another program to generates non-parameterized rules before 68 1. Create some non-terminal rules for each level of precedence. 2. Isolate the corresponding parts of the grammar. 3. Force the parser to recognize higher precedence sub expressions first. 4. Use some ad-hoc rules in the parser (match up e ls e with the closest if ) . 5. Change the abstract grammar as required. Use temporary nodes and the like. Figure 4.6: Rules to Resolve Ambiguous Grammars passing the processed grammar to a parsing tool such as YACC or Bison. For this reason, Bison thought the parameterized grammar from ÂLDOR official site is am­ biguous. For example, the BindL(R,L) in original A l d o r grammar requires some reworking in order for Bison to accept the rule. There is a lot of information which teaches programmers to resolve a set of ambiguous grammar. Figure 4.6 contains some of key points which I used to work on the ALDOR grammar rules. After I fixed the grammar, there were still many reduce-reduce and reduceshift problems which existed. A reduce-reduce conflict occurs if there are two or more rules th a t apply to the same sequence of input. The most famous example of a shift-reduce conflict is the if- th e n - e ls e statem ent. Consider an if - th e n - e ls e statem ent th a t has two i f and one e ls e statements. The problem is to define which if-statem ent the else-statem ent should bind to. Both reduce-reduce and shift-reduce conflicts are ambiguity errors. As mentioned in Section 4.2, B ison has the ability to recognize the ambiguities of a grammar. Hence, with help from Bison, I eventually developed a set of unambiguous A l d o r grammar rules. As I worked on the grammar, I also noticed th a t some of the rules could never be reached. One of the reasons is th at some tokens required by these rule cannot be generated. If a token is not in the lexical analyzer rules, then the token will not be generated. Thus, rules which contain these tokens will never be reached. Since these rules are unreachable by any grammar, they will not affect the correctness of my 69 curlyContentB_Labelled: preDocument la b e lle d postDocument { $$ = newBlockNode(CurlylK); $ $ -> sib lin g = $1; $ l- > s ib lin g = $2; $ 2 -> sib lin g = $3; > Figure 4.7: A Tree Structure with a NULL Problem parser. As a result, I simply ignored those rules. The original grammar for ALDOR is found at A l d o r ’s official website and a new A l d o r grammar which I developed is found in Appendix B. Additionally, the abstract syntax which I developed is found in Appendix C. Among all the problems I encountered while developed this parser, the sibling problem was the most troublesome issue. The sibling problem actually took me a very long time to figure out, and it has a few variations. Firstly, there are chances for a node in a programming tree to be NULL. If a node has not been created and has thus a value of NULL, then it will be a big problem when the parser tries to assign a sibling to it. Figure 4.7 is an example of this situation. The first time I encountered this problem was with the Post-Document, PreDocument, and Comment rules. It is very common for this situation to occur. Since not every programmers writes pre-documents, the Pre-Document token may not exist. In this case the Pre-Document token node will be a null node. Therefore, when a parser tries to assign a sibling to it, the parser will crash. For this reason, I created some generic non-terminal rules to make everything work. I also checked every rule which has the possibility for this problem to occur. Figure 4.8 shows actual code of my solutions for this problem. The second variation of a sibling problem is the over-written sibling problem. I did not notice this problem until I printed my syntax trees. I spent a long time 70 curlyContentB_Labelled: preDocument la b e lle d postDocument { i f ( $1 == NULL && $2 == NULL ) $$ = $3; e l s e i f ($1 == NULL && $3 == NULL) $$ = $2 ; e l s e i f ($2 == NULL && $3 == NULL) $$=$!; $$ = newBlockNode(CurlylK); $$-> ch ild[0] = $1; $ $ -> c h ild [l] = $2; $$-> ch ild[2] = $3; } Figure 4.8: Solution for the Tree Structure with NULL Problem identifying this problem and creating a solution for it. This problem also occurs quite frequently. This problem happens when a program assigns a child node as the node itself. Unfortunately, if the child node has a sibling, it will over-write the sibling of its parent node. Consider Figure 4.9 as an example. In this example, ParentNode has the body of FirstNode and is a sibling of SecondNode. Additionally, FirstNode (body of Parent) can be derived as the ChildNode with the sibling of C hildSibling. Since Bison is a bottom up parser which builds the parse tree from terminal nodes (tokens) to parent nodes. SecondNode will be assigned as the sibling of FirstNode before FirstNode has been evaluated. W hen FirstNode is later evaluated, it is assigned to FirstNode ($$ = $1); which causes SecondNode being over-written by the C hildSibling. I solved this problem by add another level of non-terminal nodes. the example in Figure 4.10. Consider FirstNode and SecondNode are both children of 71 Parent : FirstNode SEMI SecondNode { $$=$!; $ $ -> sib lin g = $3; } FirstNode ChildNode C h ild S ib lin g { $$ = $ 1 ; $ $ -> sib lin g = $2; } Figure 4.9: A Tree Structure with a Sibling Problem Parent : FirstNode SEMI SecondNode { $$ = newTempNode(TK); $$-> ch ild [0] = $1; $ $ -> c h ild [l] = $3; } FirstNode ChildNode C h ild S ib lin g { $$ = $ 1 ; $ $ -> sib lin g = $2; > Figure 4.10: Solution for the Tree Structure with Sibling Problem 72 ParentNode. Therefore, the order of these two nodes will not get mixed up. Ad­ ditionally, since every node has its own branch, an over-write problem does not happen. Similar to the FirstNode situation; one can always add in a level of nodes to prevent an over-written sibling problem. In addition, there is a node shortage problem. If there are not enough types of nodes to connect a parse tree, then the tree will not parse properly. As mentioned in Section 4.2, I tried to develop an abstract syntax grammar. A node shortage problem was caused by deleting the connection nodes from the parser. There are references on-line th a t talk about how to develop abstract grammars from concrete syntax grammars. It is worth the time and effort to create the abstract syntax grammar to improve parser efficiency. Lastly, there are communications problems between Bison and Flex; Bison and the utility files; and Bison and E macs . However, the major problems are improper buffer settings and buffer waiting problems (see Section 3.6 on page 44). In conclusion, the major problems for my parser are ALDOR grammar rules and the implementation logic of Bison. Additionally, communication between Bison and other programs can be confusing at times. 4.7 Future W ork One of the major improvements which to add to the parser is an autom atic re-parse system. I consider such as a future improvement, since it does not directly relate to my current research topic. The major concepts for this re-parse system are as follows: (1) when to re-parse and (2) what to re-parse. A parser should re-parse its source buffer once the buffer had been modified. However, I do not wish to re-parse the source buffer very often, since the parsing process is very time costly. Therefore, I propose th a t a parser only re-parse its source buffer if a source buffer 73 is idle for more than a constant amount of time. On the other hand, what should be re-parsed? Re-parsing a region of a source buffer would definitely be faster and more efficient than re-parsing the entire source buffer. Nevertheless, if I wish to re-parse part of source buffer, I have to find an appropriate parsing scope first. In some cases, modification of the buffer may lead to a completely different parse tree. Therefore, how far (on the scope levels) should a parser trace back? Determining this will be major challenge for a re-parsing process. In conclusion, the re-parsing process is not simple, and it requires more research. For this reason, I left the automatically re-parsing process as a future improvement. At this point of time, users must re-parse their source buffers manually. 4.8 C onclusion and Sum m ary In conclusion, a tree structure parser is slower but more powerful than an update “on the fly” parser. The issue is the substance of what users want and what they need. For some people, a tree structure parser may be more than what they need. On the other hand, for some people, an update “on the fly” parser may not provide enough information. The main learning and research foci of external parsers are the algorithms for parsers and communication between multiple programs. One of the greatest advan­ tages for me is th a t I had already written an external lexical analyzer. As a result, when I performed research related to communication on multiple programs, the re­ search results I found made more sense to me. Moreover, my experience helped me in the debugging process and in finding find problems. It also allowed me to figure out possible communication mistakes faster. Additionally, if I had not completed the external lexical analyzers, I would not have been able to finish these parser. I was able to learn the methods of multi-program communication, and to perform 74 several different kinds of parsing techniques. Examination of a few different kinds of parser algorithms made me think very carefully on the balance of a program. I learned th a t to implement some features, other features may have to be sacrificed. Overall, there is really no guideline or standard for a good program. Different people have different demands for a given program. It all depends on what a user needs. In this chapter, an external A l d o r parser has been discussed. In the next chapter, an ALDOR mode for EMACS will be introduced. An ALDOR mode puts lexical analyzers and parser into a package, and provides other tools. 75 Chapter 5 Aldor M ode There are already many different programming language modes which exist in E m a c s . However, none of these programming modes are implemented outside of E m a c s . Half of the functions I used to create my ALDOR mode for E m acs are implemented and executed outside EMACS. For this reason, the ALDOR mode which I present contains work which I believe has not been done previously. The aldor-m ode is the final parts of this thesis. An ALDOR programming mode for EMACS is a big package which collects together all functions which are related to A l d o r . The aldor-m ode for EMACS wraps up the work I have completed into a single package. Moreover, there is no previously existing text editor designed to support A l d o r . Hence, for A l d o r programmers, E m a c s may eventually become one of the most popular text editors th a t they use. Nevertheless, there is still a lot of research left to add program m ing modes to E m a c s . For instance, there are still m any tools which can be im plem ented to support program m ers. The whole idea of these “helping” features and tools in a te x t editor is to assist a program m er to code faster and more efficiently. 76 5.1 P rogress and M eth od s Although this component comes last within the thesis; research on language modes for E macs occurred throughout my research. Among the entire collection of modes which E m acs provides, cc-mode is the most complicated programming mode in E m a c s . For this reason, if one can understand the concept of cc-mode in EMACS, then they will be able to institute a very powerful A l d o r mode for EMACS. Eventually, I abandoned the thought of implementing an ald o r-m o d e similar to cc-mode. E m acs cc-mode is too complicated for an entry level E m a c s -L IS P pro­ grammer to understand. There are a few reasons which may explain why cc-mode is so complicated. Since cc-mode deals with “on the fly” updates to syntax analy­ sis, the implementation for such features is very complex. Moreover, the language analysis occurs inside the package, which make it very hard to separate the analysis from the editing commands. However, the main concentration of this part of my research was on communication with external programs. After both versions of the A l d o r lexical analyzers were completed, the research was mainly focused on the easiest and shortest programming mode — the P a s c a l programming mode. Unfortunately, it is not easy to write a new programming mode for E m acs just by simply studying programming language modes. One must also use the E m a cs -L IS P manual which presents a lot of information related to programming language modes and methods for adding a new programming language mode to the EMACS text editor. Although there is a lot of information on-line which discusses how to add a mode to E m a c s , most of these links redirected me to the GNU EMACS manual site or simply displayed partial contents of the E m a cs -L IS P manual. My A l d o r mode provides some basic features, such as syntax-based colouring and parsing. On the other hand, the cc-mode in EMACS is more powerful and able to give more support to programmers. For example, the cc-mode updates and re77 C olours A l d o r Types E m a g s -L IS P - Colour R e se r v e d D e f in a b l e K e y w o r d C lass I m p o r t F u n c t io n P r e-D ocum ent P ost-D ocum ent C om m ent S t r in g I d e n t if ie r F loat In teg er D e f in a b l e O p e r a t o r R e se r v e d O p e r a t o r F u ture O pera to r ’font-lock-keyword-face ’font-lock-keyword-face ’font-lock-type-face ’font- lo ck- funct ion- name-face ’font-lock-doc-face ’font-lock-doc-face ’font-lock-comment-face ’font-lock-string-face ’font- lo ck-variable- name- face ’font-lock-constant-face ’font-lock-constant-face ’font-lock-builtin-face ’font-lock-builtin-face ’font-lock-builtin-face Table 5.1; The Colour Association Used by the A l d o r Mode colours a token almost instantly. However, my A l d o r mode is expandable, more generic, and able to accept external resources. Hence it may be modified to create another programming language mode. Meanwhile, the cc-mode for E m acs is almost impossible to modify to suit another programming language. 5.2 Syntax-based C olouring Among all programming text editors, syntax-based colouring is the most basic of requirements. Programming text editors should have the ability to identify different types of tokens and colour them with appropriate colours. Research on programming text editors shows th a t it is always a very good practice to follow the habits of the majority. In the same way, I intended to keep the syntax-based colour highlighting consistent within EMACS. Table 5.1 lists the colours which I associate with ALDOR token types. One of the greatest advantages of this approach is to ensure the consistency of 78 the font colours. The fo n t-lo c k -fa c e variables can be changed in the customize pages in E m ACS. If programmers decide to change the colours which represents functions on their machines, they will end up changing the whole colour scheme. In other words, the colours will be kept consistent for different programming languages. The concepts used to colour the token strings are very simple. All it takes is for a lexical analyzer to classify the token types. An internal lexical analyzer calls functions to put all the desired text-properties on token strings. An external lexical analyzer generates an intermediate buffer which provides all the information for E macs to do its job; then EMACS calls the syntax-based colouring functions to add selected text-properties to text strings. In conclusion, the colour syntax process for a programming language only re­ quires a lexical analyzer for th a t particular programming language. As soon as the token can be identified, and token strings can be located, then everything will be completed quickly. Nevertheless, th a t is only an approximation of the truth. For instance, in C"'"'', there is a variable c++-f o n t-lo c k -e x tr a -ty p e s th at the programmer can set so th at classes like s tr in g , l i s t , and s iz e _ t appear in fo n t-lo c k -ty p e -fa c e rather than fo n t-lo ck -v a ria b len a m e-fa ce. The need for this variable results from the fact th a t cc-mode does not know how to parse header files. Similarly, in aldor-mode, it would be nice to put identifiers after in fo n t-lo c k -ty p e -fa c e , but this requires parsing. 5.3 U p d atin g A lgorithm s for th e A ldor M ode An ordinary programming text editor should also provide real time updating fea­ tures. For instance, as a programmer enters code, the editor should be able to update results immediately. Thus, a programmer will be able to tell if any syntax errors have occurred, and be able to see the modifications they have made. 79 Furthermore, a more powerful programming text editor provides features for updating the source buffer according to any of the insertions, deletions, or modihcations applied by the programmers. When should text be updated? There are many different update algorithms. One algorithm suggests the newly modified text should be updated within an appropriate amount of time while performing a minimal num­ ber of updates. In addition, this algorithm suggests updating and refreshing the screen at constant time intervals. However, this will not work well with a huge buffer. Since the buffer contains so much information, such an algorithm is very inefficient and inappropriate for handling the buffer update. Therefore, I did some minor modihcation, to make a very powerful updating algorithm. The algorithm still refreshes the screen at constant time intervals. However, it now only updates the text field which has been modified or has not yet been coloured. My approach was to update the region of text which is located two lines above and two lines below the current position. This event is triggered when the key “space” or “newline” is pressed. The algorithm assumes th at the key press of “space” or “newline” is a signal of reaching the end of the token characters. There is a critical problem associated with this algorithm. If a programmer types a set of characters, and then moves the cursor and clicks the mouse at another location and starts typing again, the algorithm will not work properly since none of the end token characters has been entered. The solution which I suggest to solve this problem is to remember the minimum and maximum location of the programmer’s cursor. In this case, the programmer can click anywhere they wish to, and can modify whatever information they please. As soon as they press any of the end token characters, all texts which do not have properties will be updated all at once. There are other text updating algorithms. For example, some text editors try to classify the entire text string which is currently being modified by a programmer. 80 For this reason, the text token is always up to date. As soon as there is any change in the text, it will trigger a lexical analyzer to classify the text token. 5.4 Features There are some features which I provide for A l d o r m ode in E m a c s . The most basic feature which I provide is syntax-based colour highlighting. Syntax-based colouring is present once a programmer opens a file whose name ends with “ . a s ” (the A l d o r format). In other words, as soon as the programmer opens an ALDOR file the whole buffer is coloured. Additionally, when programmers hover their mouse on top of a token string, the echo area displays the token type for the token. Therefore, the programmer is able to identify the type of any tokens in the buffer. Furthermore, I implemented a “print parse tree” feature for A l d o r mode in E m a c s . This feature asks a programmer to select an A ld o r file. Once a program­ mer chooses a file, the EMACS text editor pops open a new frame and displays the tree in th a t new frame. This feature allows programmers to understand how an A l d o r program has been parsed and to see the parse tree form of the program. The shortcut key for the A l d o r tree printing feature is set as (M-RET). Moreover, as mentioned in Chapter 4, there is a “parsing” feature for program­ mers who wish to parse their ALDOR program and get extra information. At this stage, the parser only inserts overlays to ALDOR program buffers. However, it has the ability to be modified to have more powerful features. The shortcut key for the parsing feature is (M-p). The last feature th a t I provide for A l d o r programmers is a comment region feature. For instance, users can highlight any region in a buffer, to comment it out. This feature is associate with the shortcut key (M-c). In this way, the users will be able to comment out any section of code they wish to leave out. However, if a user 81 plans to re-implement the comment function, newcomment. e l code in EMACS will be a good reference to study. It provides more general functions to do the same. In summary, my ALDOR mode is similar to other existing E m acs language modes in th at it provides the syntax-based colouring of tokens and allows the programmer to comment out a region. However, to print a parse tree of a program and to echo the token type selected by the cursor are new features. 5.5 Efficiency The A l d o r mode for E m a c s executes very efficiently. As my lexical analyzer was im plem ented carefully; th e tim e and space efficiency is very good. Therefore, the overall efficiency for b o th my internal and external versions in th e aldor-mode is acceptable. The parser does not have good time efficiency. However, my parser is still able to return results within a reasonable amount of time. As for space efficiency, a tree node parser takes more space than an update “on the fly” parser. However, the intermediate buffers for both parsers are as big as a tree, since they have to store all nodes. For this reason, the space efficiency for my parsers are not very good. The A l d o r print tree feature is similar to a parsing feature. In order for a parser to draw a complete A l d o r program tree, it requires full parsing of the source buffer. Hence, the time efficiency is similar to the A l d o r parser. However, results of tree printing do help many programmers. Finally, the “aldor-com m ent-region” feature is very efficient. It runs relatively fast and it does not take up any memory space. Moreover, this feature can help A l d o r programmers in many different ways. 82 5.6 P roblem s E ncountered and their Solutions The major challenge was to learn methods which allow users to add a new program­ ming language mode into E m a c s . The E m acs syntax table for a mode is difficult to understand. However, after I did research on E m acs language modes and I received help from my supervisor, I was able to resolve the problems I encountered. The insertion of hooks was very confusing. However, after studying the manual and doing some experiments on hooks, it seemed to be much simpler and easier to understand. In conclusion, to put all of the components together and create an A l d o r mode for E m acs text editor is not really a very complicated job. The E m a cs -L IS P manual is very well documented. For th a t reason, almost all of the problems can be solved by reading through the manual then performing some ex­ periments. 5.7 C onclusion and Sum m ary The ald o r-m o d e was one of the easier exercises in my research. The m ain focus of this portion was to merge all of th e com ponents together and to add th e ald o r-m o d e into E m a c s . In conclusion, ald o r-m o d e for EMACS connects the work I completed. Program­ mers are able to see each of the components. Furthermore, it gives people a chance to think through the purpose of each component in the ALDOR mode. My research process was very long and I waited a long time to put all of the pieces together. My work will help others to gain knowledge about programming language modes. E v e n tu a lly , th e y to o w ill b e a b le to a d d a n ew la n g u a g e m o d e to E m a c s . The A l d o r mode packages together the A l d o r lexical analyzers, parser, and other tools. It integrates the programming th a t I did. In the next chapter I sum­ marize the work done in the entire thesis. 83 Chapter 6 Conclusions and Summary I implemented an internal lexical analyzer, an external lexical analyzer, two versions of a parser, and bound them together with an A l d o r mode. In this chapter, I briefly review the whole thesis and state my conclusions. An internal ALDOR lexical analyzer was implemented in E m acs-L IS P . This version of lexical analyzer runs relative quickly and is able to update a buffer almost instantly. However, the codes for an internal ALDOR lexical analyzer is not re-usable for other programming languages. An external ALDOR lexical analyzer was implemented in F l e x and C. The main challenge of implementing this version of the lexical analyzer was communication. The E m acs text editor and my external lexical analyzer communicate through pipes and an intermediate buffer. Although, an external lexical analyzer does not return results as fast as an internal lexical analyzer, the code can be modified and re-used for other programming languages. The lexical analyzers put text properties into the buffer, and perform a syntaxbased colouring feature. Moreover, I implemented an external lexical analyzer to provide all information required by my external ALDOR parser. T he A l d o r parsers were im plem ented in B iso n and F l e x . My external A l - 84 DOR parsers create overlays and can be used to implement many powerful features. Additionally, a tree node parser generates parse trees for A l d o r programs with node information and binding. However, my parsers will only support full buffer parsing. Currently the parsers do not support partial parsing. 6.1 Sum m ary o f m y R esearch T im eline Internal lexical analyzers were the starting point of my research process. During the development time for my internal ALDOR lexical analyzers, I learned a lot about E m a c s . Since I was working on both versions of the lexical analyzers, I was able to compare the two versions side by side. However, most of the scanning algorithms for external lexical analyzers could not be applied to internal lexical analyzers. By the time I completed part of my internal lexical analyzer, I had gained enough knowledge to implement the external lexical analyzer for E m a c s . Nevertheless, the main challenge was the communication problems between programs. At this stage, I took a lot of time to understand and learn the concepts for communication between programs. After I completed both versions of the lexical analyzer, I started to work on my parser. The research I did gave me several different choices of algorithms to parse a program and create a parsing tree. Fortunately, I had done the external lexical analyzer already. Thus, I did not have a very hard time debugging communication problems between B iso n and EMACS. Nevertheless, I encountered communication problems between F le x and B ison. Additionally, the A l d o r grammar rules which I had in the beginning were ambiguous to B ison. For this reason, I had to try to resolve the grammar which B iso n determined as ambiguous before being able to continue in my work. I managed to get it done before I finished my ALDOR mode for EMACS. For this reason, I am able to put the parsing features and print tree 85 features into the aldor-m ode. Lastly, I wrapped up my entire work with an ALDOR mode for EMACS. T h e development process for aldor-m ode was not too rough. As a result, I had enough time to add in some extra features such as the aldor-com m ent-region. Parse this buffer and a l d o r - p r i n t - t r e e functions. 6.2 W hat can P eop le Learn from m y Work? As I have finished this research on E m ACS, people can now learn from the examples which I provide, and can learn how to add a new language mode to E m a c s . At least, they will not need to dig into the details about EMACS and E m a cs -L IS P . My goal is to develop a system not only for A l d o r , but also for other languages. For instance, other programmers should be able to add a new programming language mode once they know the grammar and tokenizing rules for a programming language. They can use my work as a tem plate to develop a new language mode in E m ACS. People can also learn about how to separate syntax analysis from editing. I have also shown th a t it is possible for EMACS to communicate with external resources and process the returned information. 6.3 C om parison o f Lexical A nalyzers It is very interesting to compare the internal and the external versions of the lexical analyzers. The results of my experiments match with what I had originally thought. Before I worked on this thesis, I suspected th a t an external approach would work slower but better than the internal version. As the results show, my internal lex­ ical analyzer works faster than the external lexical analyzer. The internal lexical analyzer has speed and space advantages. However, the external version can work concurrently via pipes, rather than in a batch mode via files. Additionally, the ex­ 86 ternal lexical analyzer is expandable and re-usable. The major factor which slows down the external version is program waiting time. As with what I learned from my parser, sometimes a program has to sacrifice its speed in exchange for more utility. Nonetheless, an external lexical analyzer is more useful than an internal version. Whereas, the internal version only works inside E macs , an external version has the ability to output in any format and provide input to programs other than E macs . For this reason, an external lexical analyzer has more flexibility than the internal version. Therefore, the experimental external lexical analyzer turned out to be very acceptable. Basically, it has very reasonable time and memory efficiency. The timing of the lexical analyzers has been measured and plotted, and is displayed in Chapter 3 of this thesis. In conclusion, internal lexical analyzers process faster than external lexical analyzers but an external lexical analyzer is more useful than an internal version. 6.4 Further Im provem ent and Future W ork At the current stage, I established basic methods for interaction between EMACS and external programs. However, the aldor-mode in EMACS can still improve a lot. For example, the parser provides a lot of information which can be used to improve both the performance and features of the aldor-mode. The update scheme for both lexical analyzer and parser can be improved as well. For instance, have Emacs update the modified buffer after a constant amount of time. Moreover, have the parser reparse a section of buffer instead of the whole buffer. One could also implement an autom atic reparse feature for the mode and an automatic syntax recovery or suggesting system. As for future work, I am planning to develop and experiment with some new algorithms th a t will enable the computer to better interact with the outside world. 87 Bibliography [1] Nathaniel S. Borenstein and James Gosling. Unix Emacs: a retrospective (lessons for flexible system design). In U IST ’88: Proceedings o f the 1st annual A C M S IG G R A P H sym posium on User Interface Software, pages 95-101, New York, NY, USA, 1988. ACM Press. [2] Peter Broadbery and Manuel Bronstein. A first course on Aldor with libaldor. available from w w w .aldor.org, March 2002. [3] Manuel Bronstein. libaldor User Guide and Reference Manual, available from w w w .aldor.org, March 2002. [4] Debra Cameron, Bill Rosenblatt, and Eric Raymond. Learning G NU Emacs. O’Reilly, 1996. [5] Charles Donnelly and Richard Stallman. Bison. Free Software Foundation, December 2004. [6] V Donzeau-Couge, B. Lang, and B. Mélèse. Practical applications of a syntax directed program manipulation environment. In IG SE ’8f: Proceedings o f the 7th international conference on Software engineering, pages 346-354, Piscataway, NJ, USA, 1984. IEEE Press. [7] Martin N. Dunstan. Aldor compiler internais II; Code generation and optimi­ sation. available from w w w .aldor.org, September 2000. [8] Bryan Ford. Packrat parsing:: simple, powerful, lazy, linear time, functional pearl. In IC F P ’02: Proceedings o f the seventh A C M S IG P L A N international conference on Functional programming, pages 36-47, New York, NY, USA, 2002. ACM Press. [9] Free Software Foundation. B ison 1.875 Manual. Free Software Foundation, November 2002. [10] Free Software Foundation. GNU Em acs Lisp Reference Manual, For Em acs Version 21, Revision 2.8. Free Software Foundation, 2002. [11] Christopher Fry. Programming on an already full brain. Com mun. ACM , 40(4):55-64, 1997. [12] Greg Harvey. W hat is Emacs? availablefrom w w w .gnu.org/softw are/em acs/ em acs.htm l, 2003. [13] Ralf Hemmecke. An Emacs mode for Aldor. available from www. hemmecke. d e / a ld o r/a ld o r-m o d e .p d f, November 2005. [14] Michael D. Hutton. Noncanonical extensions of Ir parsing methods, available from w w w .eecg .to ro n to .ed u /~ m d h u tto n /p ap ers/n o n can .p d f, 1990. [15] A D Kennedy. Semantics of categories in Aldor. available from www.ph.ed. a c .u k /~ b j/p arald o r/W W W /d o cs/d isc u ssio n /d e fin e .p d f , 2001. [16] John Levine, Tony Mason, and Doug Brown. Lex & Yacc. O ’Reilly, 1992. [17] Vern Paxson. Flex Manual, version 2.5. Free Software Foundation, 1995. [18] Erik Poll and Simon Thompson. The Type System of Aldor. Technical Report 11-99, Computing Laboratory, University of Kent at Canterbury, July 1999. [19] Roger C. Schank. Language and memory. Technical report, Yale University, 1990. [20] Richard M. Stallman. G NU Em acs Manual. Free Software Foundation, 2002. [21] Keith Waclean. A tutorial introduction to GNU Emacs. available from h tt p : / / w w w .lib .u ch icag o . e d u /k e i th /tc l - c o u r s e /e m a c s - tu to r ia l.h tm l, 1997. [22] Stephen M. W att. Aldor III. available from w w w .aldor.org, September 2000. [23] Stephen M. W att. Aldor: Implementation I. available from w w w .aldor.org, September 2000. [24] Stephen M. W att. Aldor: Interfaces, available from w w w .aldor.org, September 2000 . [25] Stephen M. W att. Aldor: The language and recent directions, available from w w w .aldor.org, June 2000. [26] Stephen M. W att. Aldor: An introduction to the language, available from w w w .aldor.org, September 2002. [27] Stephen M. W att. Aldor User Guide. The Numerical Algorithms Group Lim­ ited, 2002. 90 [28] LALR parser. http://en.w ikipedia.org/w iki/LA LR_parser. in Wikipedia, the free encyclopedia. [29] Saul Youssef. Prospects for category theory in Aldor. available from h t t p : / / a t l a s .bu. ed u /~ y o u ssef/p a p ers/m a th /a id er/a ld o r. pdf, 2001. 91 A ppendix A A Lexical Structure for Aider A .l Lexical Structure The following information is referenced from the ALDOR User Guide [27]. A , 1.1 C haracters The stan d ard A l d o r character set contains th e following 97 characters: • th e blank, ta b and newline characters • the Roman letters: a-z A-Z • the digits: 0-9 • and th e following special characters: ( [ { < , l e f t p aren th esis l e f t bracket l e f t brace le s s than comma ) ] } > . r ig h t p a ren th esis r ig h t bracket r ig h t brace g rea ter than period 92 / ; sem icolon qu estion mark equals plu s ampersand sla sh apostrophe (quote) double quote circum flex commercial at d o lla r colon exclam ation mark underscore minus (hyphen) a s te r is k b ack -slash grave (back-quote) v e r t ic a l bar tild e sharp percent # 7. Other characters may appear in source programs, but only in comments and string-style literals. Blank, tab and newline are called white space characters. All the special characters except quote, grave and ampersand are required for use in tokens. Grave and ampersand are reserved for future use. A . 1.2 T h e E scap e C haracter Underscore is used as an escape character, which alters the meaning of the following text. The nature of the change depends on the context in which the underscore appears. An escaped underscore is not an escape character. An escape character followed by one or more white space characters causes the white space to be ignored. The remainder of this section assumes th a t escaped white space has been removed from the source. A . 1.3 Tokens T he sequence of source characters is p artitioned into tokens. T he longest possible m atch is always used. The tokens are classihed as follows: 93 • the following language-defined keywords: and catch e ls e for goto in lin e lo c a l or ru le tr y add but do fix generate in le t of return to y ie ld . , 1 [ => ] ; +-> { ; := } always d e fa u lt except f lu id has is macro pretend s e le c t where :: :* ( ) a sse r t d efin e export fr e e if is n t never r ef then w hile $ ) break delay extend from import ite r a te not repeat throw with @ The characters in a keyword cannot be escaped. T hat is, if a character is escaped, the token is not treated as a keyword. • the following are not defined by the language but are reserved words for future use: delay f i x (I I) [l is l] is n t l e t {l 1} ' ru le & II • the following set of definable operators: by case # + * ** / <= >= \ mod - quo rem += /\ « \/ » < > <- -> The characters in an operator cannot be escaped. • identifiers: 0 1 [y.a-zA-Z] [%? ! a-zA-ZO-9] * 94 Any non-white space standard character may be included in an identifier by escaping it. Thus “a”, "a_*" and “_ i f ” are all identifiers. The escape character is not part of the identifier so “ab” “_a_b” represent the same iden­ tifier. Identifiers are the only tokens for which the leading character may be escaped. • string-style literals: ( ti ; ^ ( ti ) An underscore or double quote may be included in a string-style literal by escaping it. • integer-style literals: [2-9] [0-9] [0-9] + [0-9] + 'r> [0-9A-Z] + Escape characters are ignored in integer-style literals and so may be used to group digits. • Floating point-style literals: [ 0 - 9 ] [ 0 - 9 ] + { [eE]{ [+ -]> [0 -9 ]+ } [0-9] + '. ’ [0-9] [eE] { [+ -] > [0-9] +} [0-9] + [eE] { [+ -] } [0-9] + [0-9] + ' r ' [ 0 - 9 A - Z ] [ 0 - 9 A - Z ] + { e { [+-] }[0 -9 ]+ } [0 -9 ]+ ‘r ’ [ 0 - 9 A - Z ] [ 0 - 9 A - Z ] * { e { [+ -]} [0 -9 ]+ } [0-9] + ‘r ’ [0-9A-Z] + ' e ' { [+ -]} [0 -9 ] + Escape characters are ignored in floating point-style literals and so may be used to group digits. Certain lexical contexts restrict the form of floats allowed. This distinguishes cases such as s in 1.2 vs m. 1.2. A floating point literal may not 1. begin with unless the preceding token is a keyword other than " I ) ", "]" or "}"; 2. contain if the preceding token is 3. end with if the following character is • comments: The two characters “—” and all characters up to the end of the line. Under­ scores are not treated as escape characters in comments. 95 • documentation-. The two characters “++” and all characters up to the end of the line. Under­ scores are not treated as escape characters in documentation. • leading white space: a sequence of blanks or tabs at the beginning of a line. • embedded white space: a sequence of blanks or tabs not at the beginning of a line. • newline: a new line character. • layout markers: SETTAB BACKSET BACKTAB These do not appear in a source program but may be used to represent a linearized from of the token sequence. Comments and embedded white space are always ignored, except as used to separate tokens. For example, “abc” is taken as one token but “a b c” is taken as three. A .2 D ifferences B etw een th e Im plem ented Lexi­ cal Scanner and th e A ldor Lexical Structure There are some differences between my version of the tokens and the tokens from the A l d o r official web site. One of the major differences is th a t my external lexical analyzer does not handle an (underscore) character. In the ALDOR official tokens fist, acts as an escape symbol. In other words, the character can be ignored. For example, an A l d o r program reads the following as exactly the same token string: “apple”, “_apple”, “a_pp_l_e”, etc.. However, in my program, an symbol does not get implemented, so, it is not handled properly. The major reason I did not implemented it is due to the difficulty of the word counter. This symbol will mess up my work count variable, and cause the buffer not colour properly. For the reason, I had not implement the identification in a string. 96 Finally, the last difference is the layout markers. Although the type of token does exist in ÀLDOR official tokens, it does not exist in my program. These tokens are created by complicated rules th a t track levels of indentation th a t I chose not to implemented. However, they may be included in the future. 97 A ppendix B C ontext Free Grammars for Aldor /* RESERVE WORDS * / “/.token ADD ALWAYS AND ASSERT BREAK BUT CATCH DEFAULT DEFINE DELAY DO ELSE EXCEPT EXPORT EXTEND FIX FLUID FOR FREE FROM GENERATE GOTO HAS IF IMPORT IN INLINE IS ISNT ITERATE LET LOCAL MACRO NEVER NOT OF OR PRETEND REF REPEAT RETURN RULE SELECT THEN THROW TO TRY WHERE WHILE WITH YIELD BY CASE FINALLY MOD QUO REM /* SYMBOLS - DEFINEABLE * / “/.token SETMINUS POUND TIMES PLUS MINUS DIVIDE LESSTHAN EQUALS GREATERTHAN ATSIGN TILDE CARET VERY_LESS VERY.GREATER LESSEQ GREATEREQ GETS TO_SYMBOL JOIN MEET STARSTAR DOTDOT UNEQUALS CARETEQUALS /* SYMBOLS - RESERVED OPS * / “/.token DOLLAR SQUOTE LPAREN RPAREN FUN_ARROW COMMA COLON IS_ASSIGNED SEMI MACRO_IS EXIT LBRACK RBRACK LBRACE RBRACE VERY_EQUAL BAR /* SYMBOLS - FUTURE OPS * / “/.token LLBRACK LLBRACE LLPAREN RRBRACK RRBRACE RRPAREN GRAVE AMPERSAND OROR /* OTHERS */ “/.token INT ID FLOAT PREDOC POSTDOC STRING COLON_STAR DCOLON DOT PLUSMINUS BACKSET BACKTAB SETTAB s ta r t: curlyC ontents_Labelled 98 expression: I I la b e lle d la b e lle d expression SEMI la b e lle d exp ression SEMI : comma I d ecla ra tio n I ATSIGN atom la b e lle d I ATSIGN atom d e cla ra tio n : MACRO s ig I EXTEND s ig I LOCAL s ig I FLUID s ig I DEFAULT s ig I DEFINE s ig I FIX s ig I INLINE sigOp fromPartOpt I IMPORT sigOp fromPartOpt I EXPORT SigOp I EXPORT sigOp toPart I EXPORT sigOp fromPart to P a r t: TO in fix e d fromPartOpt: fromPart /* empty * / fromPart: FROM infixedCL infixedCL: in fix e d infixedCL COMMA in fix e d SigO p : s ig /* empty */ s ig : 99 declB inding I block d e c lP a r t: COLON type I COLON_STAR type comma: commaltem I commaltem COMMA comma commaltem: binding_AS I binding_AS WHERE commaltem declBinding: bindingR_IED_AS in fixed E xp rsD ecl: infixedE xprs I infixedE xprs declPart in fixed E xp rs: infixedExpr I infixedExpr COMMA infixedE xprs dgc_Binding_Symbol: IS_ASSIGNED I VERY_EQUAL I MACRO_IS I FUN_ARROW bindingR_IED_AS: in f ixedExprsDecl I infixedE xprsD ecl dgc_Binding_Symbol binding_AS binding_AS: bindingL_Infixed_AS binding_BS: bindingL_Infixed_BS 100 b in d in g_C ollection : bin din gL _Infixed _C ollection bindingL_Infixed_AS: anyStatement I in fix e d dgc_Binding_Symbol binding_AS bindingL _Infixed_C ollection: c o lle c t io n I in fix e d dgc_Binding_Symbol bin d in g_C ollection bindingL_Infixed_BS: balStatem ent I in fix e d dgc_Binding_Symbol binding_BS anyStatement: IF commaltem THEN binding_AS flow _as balS tatem ent: flow_bs flo w _ a s: c o lle c t io n IF commaltem THEN binding_BS ELSE binding_AS c o lle c t io n EXIT binding_AS ite r a to r s REPEAT binding_AS TRY binding_AS CATCH casesOpt finalPart_AS TRY binding_AS CATCH casesOpt alwaysPart_AS SELECT binding_AS IN cases DO binding_AS DELAY binding_AS GENERATE genBound binding_AS ASSERT binding_AS ITERATE ITERATE name BREAK BREAK name RETURN RETURN c o lle c t io n YIELD binding_AS 101 I EXCEPT binding_AS I GOTO id I NEVER flow_bs : c o lle c t io n I IF commaltem THEN binding_BS ELSE binding_BS I c o lle c t io n EXIT bindingJBS I ite r a to r s REPEAT binding_BS I TRY binding_AS CATCH casesOpt finalPartJBS I TRY binding_AS CATCH casesOpt alwaysPart_BS I SELECT binding_AS IN cases I DO binding_BS I DELAY binding_BS I GENERATE genBound binding_BS I ASSERT binding.BS I ITERATE I ITERATE name I BREAK I BREAK name I RETURN I RETURN c o lle c t io n I YIELD binding_BS I EXCEPT binding_BS I GOTO id I NEVER genBound: TO commaltem OF /* empty * / cases : b in d in g_C ollection casesO pt: b in d in g_C ollection I /* empty * / alwaysPart_AS: ALWAYS binding_AS alwaysPartJBS: 102 ALWAYS binding_BS fin alP artJB S : FINALLY bindingJBS I /* empty * / finalPart_AS: FINALLY binding_AS I /* empty * / c o lle c tio n ; in fix e d in fix e d ite r a to r s ite r a to r s : it e r a t o r s l it e r a t o r s l : ite r a to r it e r a t o r s l ite r a to r ite r a to r : FOR forLHS IN in fix e d FOR forLHS IN in fix e d suchThatPart WHILE in fix e d forLHS: in fix e d I FREE in fix e d I LOCAL in fix e d I FLUID in fix e d suchThatPart: BAR in fix e d in fix ed : infixedExpr I infixedExpr declPart I block infixedE xpr: 103 ell_OP e3 e3: e4 I e3 AMD e4 I e3 OR g 4 I e3 la tticeO p e4 e4: e5 I e4 HAS e5 I e4 relationO p e5 I relationO p e5 e5 : e6 e5 seqOp e5 seqOp e6 e6 : el e6 plusOp e l plusOp e l el: e8 e l quotientOp e8 e8: e9 e8 timesOp e9 e9: e ll_ e l2 e ll_ e l2 powerOp e9 e llO P : op ell_OP DCOLON e l2 ell_OP ATSIGN e l2 ell_OP PRETEND e l2 104 e ll_ e l2 : e l2 e ll_ e l2 DCOLON e l2 e ll_ e l2 ATSIGN e l2 e l l e l2 PRETEND e l2 type: e l l _ e 12 el2: e l3 I e l3 arrowOp e l2 el3: e l4 I e l4 DOLLAR qualT ail q u a lT a il: leftJu xtap osed I leftJu xtap osed DOLLAR qualT ail opQualTail: m olecule I m olecule DOLLAR opQualTail el4: e l5 I e l4 EXCEPT e l5 I WITH declM olecule I e l4 WITH declM olecule I ADD declM olecule I e l4 ADD declM olecule el5 : a p p lic a tio n op: arrowGp latticeO p relationO p seqOp plusOp quotientOp 105 timesOp powerOp nakedOp: arrowTok la ttice T o k relationT ok seqTok plusTok quotientTok timesTok powerTok arrowOp: arrowTok arrowTok DOLLAR opQualTail latticeO p : la ttice T o k I la ttice T o k DOLLAR opQualTail relationO p: relationT ok I relationT ok DOLLAR opQualTail seqOp: seqTok I seqTok DOLLAR opQualTail plusOp: plusTok I plusTok DOLLAR opQualTail quotientOp: quotientTok I quotientTok DOLLAR opQualTail timesOp: timesTok I timesTok DOLLAR opQualTail powerOp: powerTok 106 powerTok DOLLAR opQualTail arrowTok: GETS TÜ_SYMBÜL la tticeT o k : JOIN MEET relationT ok: ISNT I IS I CASE I VERY_LESS I VERY_GREATER I LESSTHAN I LESSEQ I EQUALS I GREATEREQ I GREATERTHAN I UNEQUALS I CARETEQUALS seqTok: BY DOTDOT plusTok: PLUS I MINUS I PLUSMINUS quotientTok: MOD I QUO I REM timesTok: TIMES I DIVIDE I SETMINUS 107 powerTok: CARET I STARSTAR a p p lica tio n : rightJuxtaposed rightJuxtaposed : jRight_Mol leftJu xtap osed : jLeft_Mol jRight_Mol: jL eft.M ol I jLeft_M ol jRight_Atom I NOT jRight_Atom jRight.Atom: jLeft.Atom I jLeft.Atom jRight.Atom I NOT jRight.Atom jL eft.M ol : m olecule NOT blockEnclosure jL eft.M ol blockEnclosure jL eft.M ol DOT blockM olecule jLeft.Atom: atom I NOT blockEnclosure I jLeft.Atom blockEnclosure I jLeft.Atom DOT blockM olecule m olecule atom enclosure enclosure parened I bracketed I quotedlds 108 d eclM olecule: a p p lic a tio n I block I /* empty * / blockM olecule: atom I enclosure I block blockE nclosure: enclosure I block block: p iled .E x p ressio n I curly_L abelled parened: LPAREN RPAREN I LPAREN exp ression RPAREN bracketed: LBRACK RBRACK I LBRACK exp ression RBRACK q u oted ld s: SqUOTE SqUOTE I SqUGTE names SqUGTE names : name I name CGMMA names atom: id lite r a l id: ID I PGUND I TILDE 109 name : ID I nakedOp lite r a l: I I INT FLOAT STRING doc_Expression: preDocument exp ression postDocument preDocument: preDocumentList preDocumentList: PREDOC preDocumentList I /* empty * / postDocument: postDocumentList postDocumentList: POSTDOC postDocumentList I /* empty * / p iled _E xp ression : SETTAB pileC ontents_E xpression BACKTAB p ileC on ten ts_E xp ression : doc_Expression I pileC ontents_E xpression BACKSET doc_Expression curly_L abelled: LBRACE curlyC ontents_L abelled RBRACE curlyC ontents_L abelled: curlyC ontentsL ist_L abelled 110 curlyC onteiitsL ist_L abelled: cu rlyC on ten tl.L ab elled I curlyC ontentl_L abelled curlyContentB_Labelled %prec SEMI curlyC ontentl_L abelled: curlyC ontentl_L abelled curlyContentA_Labelled I /* empty * / curlyContentA_Labelled: curlyContentB_Labelled SEMI postDocument curlyContentB_Labelled: preDocument la b e lle d postDocument 111 A ppendix C Abstract Synteix for Aldor d e c l_ ty p e : MACRO I FLUID I EXTEND I DEFAULT I LOCAL I INLINE I FREE I IMPORT I EXPORT binding_sym bol: IS.ASSIGNED I VERY_EQUAL I MACRO_IS I FUN_ARROW naked_operator : GETS 1 MEET 1 PLUS 1 PLUSMINUS 1 QUO 1 GREATERTHAN 1 UNEQUALS TO_SYMBOL 1 CARETEQUALS BY 1 ISNT CASE 1 MINUS VERY_GREATER LESSEQ 1 REM DIVIDE 1 GREATEREQ STARSTAR JOIN DOTDOT VERY_LESS MOD EQUALS SETMINUS CARET IS LESSTHAN TIMES lit e r a l : INTEGER.LITERAL I FLOATING_POINT_LITERAL I STRING LITERAL name : IDENTIFIER | naked_operator names : name I names COMMA name atom : 112 literal I IDENTIFIER m olecule atom 1 LPAREN RPAREN 1 LPAREN expression RPAREN 1 LBRACK RBRACK 1 LBRACK exp ression RBRACK 1 SQUOTE SQUOTE 1 SQUOTE names SQUOTE 1 block block: LBRACE expression RBRACE p re_d oc_Iist : I PREDOC p re_d oc_Iist p o st_ d o c_ Iist : I POSTDOC p o st_ d o c_ Iist expression expression COMMA expression I expression SEMI expression I expression SEMI I p re_d oc_Iist exp ression p o st_ d o c_ Iist I atom ATSIGN expression I atom ATSIGN I d ecla ra tio n I comma item comma_item binding binding WHERE comma_item binding : statem ent in fix e d binding_symboI statem ent sig_bIock : signatu re I block 113 d e cla ra tio n decl_type decl_type decl_type decl_type sig_b lock sig_b lock FROM in fix e d sig_b lock TO in fix e d sig_b lock FROM in fix e d TO in fix e d sign atu re : in fix e d in fix e d binding_symbol binding c o lle c t io n : in fix e d in fix e d i t e r a t o r _ lis t case : c o lle c t io n in fix e d binding_symbol c o lle c t io n statem ent IF comma_item THEN binding IF comma_item THEN binding ELSE binding c o lle c tio n c o lle c t io n EXIT binding i t e r a t o r _ lis t REPEAT binding TRY binding CATCH case TRY binding FINALLY binding TRY binding CATCH case FINALLY binding SELECT binding OF case ASSERT binding ITERATE ITERATE IDENTIFIER BREAK BREAK IDENTIFIER RETURN RETURN c o lle c tio n YIELD binding EXCEPT binding GOTO IDENTIFIER GENERATE binding GENERATE TO comma_item OF binding NEVER i t e r a t o r _ lis t : 114 ite r a to r ite r a to r l i s t COMMA ite r a to r forLHS_KW I FREE I FLUID I LOCAL ite r a to r FOR forLHS_KW in fix e d IN in fix e d FOR forLHS_KW in fix e d IN in fix e d "I" in fix e d WHILE in fix e d in fix e d : infixed_expr in fix e d COMMA infixed_expr infixed_expr COLON infixed_expr infixed_expr COLON_STAR infixed_expr block infixed_expr ADD in fix e d .e x p r NOT infixed_expr WITH infixed_expr operator infixed_expr infixed_expr operator infixed_expr infixed_expr infixed_expr ADD infixed_expr infixed_expr AND infixed_expr infixed_expr ARROW infixed_expr infixed_expr AT infixed_expr infixed_expr DCOLON infixed_expr infixed_expr DOLLAR infixed_expr infixed_expr DOT infixed_expr infixed_expr EXCEPT infixed_expr infixed_expr HAS infixed_expr infixed_expr OR infixed_expr infixed_expr PRETEND infixed_expr infixed_expr WITH infixed_expr infixed_expr operator infixed_expr m olecule operator operator naked_operator 115 naked_operator DOLLAR opQualTail opQualTail ; m olecule I m olecule DOLLAR opQualTail 116 A ppendix D Codes for my Work Instead of insert the source codes in this section, it will make more sense to put them on my web site. Hence, others will be able to download, modify, and execute the codes. Additionally, it will require around one hundred pages to print the codes. All of my codes will be upload to my UNBC web site at the following address: h ttp : //w e b .u n b c. c a /~ h sie h y 117 A ppendix E UML Diagram of External Lexical Analyzer 118 ProcewObkct i-Fj Top Package: A dofl cR' g H aldor-acan-Ws-mglon 0*WbdAwx%»M* emee-boffer f I «dwkxy»» o’ CO slart-ffocess « c m a te» > meg a ratum tailed g #c#nn*rou%iui (te*lore-polnl-t#ler) s. p [Em* loken (slmt-point. endpoint, token type) f a g > p t gel-1-vatue ektof-cotooT-eyntmx proceee-done «deeboy»» «deekxy» return W e d Colour Texts (successtbl) more lists In process object OR scanner process is not done 1eldor-f*e(Wrom-H$i) scannef-done :« ek Ipnooeee-eendneg A ppendix F UML Diagram of Parser 120 Figure F .l: UML Diagram for the Parser 121