Introduction

Hello wordfile creator!

Here is an add-on macro for the syntax highlighting sort and the test for duplicate words macro set. During execution of the macro SortLanguage duplicate words within a color group are automatically removed. And the macro TestForDuplicate finds and reports duplicate words in different color groups which then can be removed by the user. But both do not find and report invalid word definitions.

The macro TestForInvalid is designed to test a language definition for invalid words in all color groups and creating a report. The macro TestForInvalid never modifies the wordfile with the language definition. It only creates a report of possibly found invalid words in a new temporary file which is never saved. So this macro does not change anything on your hard disk.

The macro TestForInvalid can be run before or after the macros SortLanguage and TestForDuplicate. But I suggest using TestForInvalid after the other macros to decrease the number of possible invalid words if those invalid words were additionally duplicate.

Invalid words are not really bad because they have no bad effect on the general syntax highlighting. The only problem with invalid words is that a word is maybe not highlighted with the expected color. And maybe invalid words decreases the speed of the syntax highlighting engine of UE/UES a little because of the higher amount of words.

For wordfiles containing only a single syntax highlighting language see also command line tool TestInvalid and the Windows GUI UE Companion Utility.


What is an invalid word?

To answer that question first it must be explained and understood how the syntax highlighting engine of UltraEdit and UEStudio works.

A text file is nothing else than a more or less large sequence of characters. Some rules are necessary to be able to interpret this sequence of characters and convert it into something which can be understood by you (your brain) or a program like UE/UES or a compiler. For programming languages there are 4 main rules which work all according to the same principle: lines, comments, strings and delimiters.

Lines

When a text file is read by a program the first rule used is: scan for line termination character(s) to split up the big sequence of characters into smaller parts called "lines". But this is already more complicated as it might be because there are 3 standards:

The character carriage return has the byte code 13 decimal or 0D hexadecimal and is often specified in strings as \r or ^r.
The character line-feed has the byte code 10 decimal or 0A hexadecimal and is often specified in strings as \n or ^n.

So a general text editing program like UltraEdit must be already capable to handle 3 different formats of a text file. But that split up into lines becomes even more complicated if a text file contains more than 1 of the 3 formats above. This is often caused by a programming error like on Windows operating systems opening a file for write in text mode, but printing to the file with "\r\n" which results in a file containing 2 carriage returns before the line-feed instead of only 1. In text mode every \r\n is automatically converted into a \n when reading the sequence of characters from a text file. And when writing to a text file opened in text mode, every \n is automatically written as \r\n. So if the programmer uses in the program code \r\n when writing to a text file in text mode, the program writes \r\r\n and the line ending detection problems start. Also in PHP, Perl and other scripts for HTML files the line termination is often mixed because of wrong handling by the script developer. And also using the wrong FTP transfer mode when transferring text files with FTP between UNIX servers and Windows computers is a source of creating text files with line terminations which follow none of the 3 standards above.

Comments

A comment is a sequence of characters which should be ignored by the program when interpreting the content of a file. But how to identify in the big sequence of characters of a file now such a comment character sequence, even if this file is split up often already into lines?

Most comments are specified by special character sequences like

/*   for Block Comment On and
*/   for Block Comment Off or
//   for Line Comment On

for example for C/C++. The rule is quite simple and best explained with an example.

If /* is found in the sequence of characters in a file, a block comment starts and it ends if */ is found. For line comments the same rule is used. The only difference is that the Comment Off character sequence is predefined with the line termination character(s).

This simple rule for comments can be further extended. Some interpreters for example support nesting block comments where several Block Comment On/Off sequences can be inside a block comment and counted by the program which reads it to find the Block Comment Off sequence which belongs to the first Block Comment On sequence instead of ending the block comment on first occurrence of the Block Comment Off sequence.

For line comments there are sometimes also several other additional rules. Most of such extended rules for line comments exist when the Line Comment On is a single character instead of a character sequence. The developer of such a language definition maybe thought, it is more easily to use special rules instead of simply add a second character to the Line Comment On definition to avoid misinterpretation when reading the character sequence of the file. I personally often cannot follow this thoughts and I'm a programmer.

Strings

After splitting up the character sequence of a file into lines and editing out those parts which are comments and so are ignored for most other evaluations, the next step is often to find strings. A string is a sequence of characters which has a special meaning for various reasons and so the characters of a string should be always hold together and care must be taken when modifying this sequence. But how to identify in the number of remaining character sequences in a file such string character sequences?

Wherever strings are possible there is always at least 1 special character which identifies the start and end of a string character sequence. Often used is the double quote character. When this character is found in the sequence of characters in a file, a string starts, and it ends when the same character is found again. This simple rule can be extended by several other rules like a second string identifying character like the single quote character or an escape character (for example the backslash) which means that after the starting string identifying character the character following the escape character never ends the string sequence. Some languages also have the rule that a string sequence must end before the line termination character(s). For those languages DisableMLS (disable multi-line string) should be used. For other languages like HTML or C/C++ multi-line strings are possible (often with an extra rule) and for those languages EnableMLS can be used.

Delimiters

After applying the rules for lines, comments and strings there are still enough character sequences which must be further divided into many smaller parts which human call "words". This is done by using the same method as above. A set of characters has to be defined which delimits those character sequences into words. Everything between 2 delimiters is a word. Do you understand what the sentence before means?

The delimiter characters define what a word is and not the characters of a word!

For example look on highlighting. I'm sure, you will read this as 1 word. But why, because it contains also the words high and light which you also know as words? You interpret it as 1 word because of the delimiter space on the left side and the delimiter point on the right side. So never forget, the delimiters define what a word is. Without the delimiters the character sequences of a text file cannot be read and interpreted by you or a program. Look on the 'C' code example below:

printf("Found %u error%s!\n",errorcount,/* no 's' by exactly 1 error */errorcount==1?"":"s");

This is a valid code line for a 'C' compiler and 'C' programmers can also read it with syntax highlighting. But it would be much more difficult to read for 'C' programmers without syntax highlighting because our brain is trained to use only a small set of delimiters which is needed for reading text. The code example above is far away looking like a normal text.


Back to the question of this section: What is an invalid word?

Now the answer should be simple: Every character sequence in the color groups which contains a delimiter character in combination with other delimiter characters or normal characters. The delimiters specify what a word is and of course every delimiter itself is also a single character word. So it is simply not possible that a delimiter character is at start (see exception below), in the middle or the end of a word. And 2 or more delimiters cannot be combined to a word.


Which characters should be specified as delimiters for syntax highlighting?

Which characters should be specified as delimiters for syntax highlighting depends on the rules of the program used to read the text file you write and edit. As a general rule the space character must be specified as delimiter character. This is needed because the space character is the main delimiter character for the wordfile itself which is also a text file. And that answers the following question which is often asked by users not understanding how the syntax highlighting engine works:

Is it possible to define a word with a space?

No, that is not possible because the space is a delimiter for wordfiles and the delimiter characters specify what a word is. So it is not possible to define a character sequence with a space character to be interpreted as a word.

Often also a delimiter and often forgotten is the tab character which is a not visible character like the space. Don't forget to specify the tab character in the set of delimiters. How the tab character is interpreted and displayed can vary. It depends on the tab stop value(s) for the current file. Or it is like for HTML always displayed as a single space (except in a preformatted text area). Be careful when copying a wordfile definition from the browser window into a text file. Make sure you have a real tab character in the set of delimiter characters after pasting the text into the text file.

The line termination characters carriage return and line-feed are for text files always delimiter characters and cannot be specified extra as delimiter characters.

The characters which specify block and line comments and strings should be also always defined as delimiter characters. This is not absolutely necessary because the text file is interpreted in the order written above, but it should be done.

Operators and braces of any kind are for programming languages also delimiter characters. A color group with operators contains often also invalid words because of a combination of delimiter characters. For example == or != are invalid words if the equal sign and the exclamation mark are delimiters. Such operator specification mistakes in the word list for syntax highlighting is often not detected because the = and the ! are also specified in the word list as single character in the same color group and so nobody can see that for example != is highlighted with the color of ! and the color of = and the combined character sequence != is simply useless in the list of words. Remember, the delimiter characters specify what a word is.

Something special is the usage of marker characters with a definition like:

/Marker Characters = "[]%%"

Marker characters are a variant of strings. Every pair of marker character specifies a sequence of characters to be highlighted with one color. But in comparison with strings the start and end character for such a special highlighted or marked character sequence must not be identical like [] above shows. But since UE v9.20 marker characters can have the same start and end characters like %% above shows too. A character sequence highlighted with a marker string cannot span over a line termination (like single-line strings).

The marker characters should be also specified as delimiter characters.

And often most other special characters in the ASCII table are used as delimiters too. You can also use ANSI characters as delimiters, but not Unicode characters because the wordfile must be an ASCII/ANSI file. Here is a very often used delimiters definition:

/Delimiters = ~!@%^&*()-+=|\/{}[]:;"'<> ,tab.?

Note: tab it the line above is in real the tab character.

The delimiter characters are always case-sensitive independent of the keyword Nocase in the language definition line. But this is important only for letters which are normally not used as delimiters.


Usage of macro TestForInvalid

The usage of the macro TestForInvalid is as simple as for macro SortLanguage or TestForDuplicate. Set the caret anywhere within the language definition you want to test for invalid words and start the macro TestForInvalid. That's all, lean back and look what's going on.

If the macro finds no invalid words the report contains only following line:

Congratulations! No invalid words found.

If the macro finds invalid words the report looks like the report below for language PHP in standard wordfile.txt of UltraEdit v13.00a:

Sorry! Found following invalid words:

!=                                      <- contains the delimiter:  =
&&                                      <- contains the delimiter:  &
*=                                      <- contains the delimiter:  =
++                                      <- contains the delimiter:  +
+=                                      <- contains the delimiter:  =
--                                      <- contains the delimiter:  -
-=                                      <- contains the delimiter:  =
.=                                      <- contains the delimiter:  =
/=                                      <- contains the delimiter:  =
<=                                      <- contains the delimiter:  =
==                                      <- contains the delimiter:  =
||                                      <- contains the delimiter:  |
class.com                               <- contains the delimiter:  .
class.dir                               <- contains the delimiter:  .
class.dotnet                            <- contains the delimiter:  .
class.variant                           <- contains the delimiter:  .

Now you have to look at this report and you should remove the invalid word in the wordfile or modify the set of delimiter characters.

For correct identifying invalid word definitions in the list of words the macro has to apply some special rules.

  1. The (visible) delimiters can be also specified in a color group as single character words. But a combination of delimiter characters is not valid which is a frequent mistake in the color group for operators (see example report above).
  2. Since UE v10.00 it is allowed that a word definition starts with a delimiter character like the HTML entities in standard wordfile.txt which starts with & although this character is also a delimiter character. The & must be like the ; a delimiter or the entities would not be highlighted correct. But delimiters are not allowed anywhere else except as first character. That's the reason why the semicolon of the HTML entities is specified separate although that means all semicolons in the text of HTML files are then highlighted with the color of HTML entities, not only when found at the end of a HTML entity.
  3. For language definitions which contain the case-sensitive keywords HTML_LANG or XML_LANG in the first line, words starting with < or </, and/or ending with > or /> or = are allowed, even when the 4 characters are also delimiters.
  4. Every marker character pair must be specified like a word in a color group, although the marker characters can and should be also delimiter characters.

If you are interested in how the macro handles these 4 rules, read the comments for macro TestForInvalid in the macro code file SyntaxTools.uem.

You can assign a hotkey to macro TestForInvalid if it is used frequently.

The macro uses the UltraEdit style regular expression engine. If you prefer the UNIX or Perl compatible regular expression engine you have to insert the macro command UnixReOn or PerlReOn before every macro exit. Search for UnixReOn in the file SyntaxTools.uem to find the 4 exit positions. The macro source code is not included here.

To use this macro you need at least v8.20 of UltraEdit or UEStudio. The macro was developed and tested with UE v10.10c, v11.20a, v12.20b+1 and v13.00a+2.

If you find any bugs or have other related questions, post it at http://www.ultraedit.com/forums/viewtopic.php?t=443.


Disclaimer

THIS MACRO IS PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, STATUTORY OR OTHERWISE, INCLUDING WITHOUT LIMITATION ANY IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO USE, RESULTS AND PERFORMANCE OF THE MACRO IS ASSUMED BY YOU AND IF THE MACRO SHOULD PROVE TO BE DEFECTIVE, YOU ASSUME THE ENTIRE COST OF ALL NECESSARY SERVICING, REPAIR OR OTHER REMEDIATION. UNDER NO CIRCUMSTANCES, CAN THE AUTHOR BE HELD RESPONSIBLE FOR ANY DAMAGE CAUSED IN ANY USUAL, SPECIAL, OR ACCIDENTAL WAY OR BY THE MACRO.