Bug fix for Syntaxer.pmod
It didn’t take too long to notice that I had f–ked up the HTMLParser class a little bit. It was how entities was handled that didn’t really worked as expected – entities in tag attributes was duplicated and inserted in the tag content – but the good side of it is that I learned about the HTMLParser method of the HTMLParser Pike class. I only want to match entities in the data section – i.e. tag content – and not in attributes and the HTMLParser method tells you, as the name implies, in what context the entity is found. So my entity callback function now looks like:
- //! Entity callback
- protected void ecb(Parser.HTML p, string _data)
- {
- if (p->context() == "data")
- line += colorize(entify(_data), "entity");
- }
which hopefully will be completely bug free now. So one down 854 to go
Codify RXML tag and Syntaxer.pmod 17:31, Sat 17 October 2009 :: 183.9 kB
A Pike syntax highlighting module
So I thought I should try to port my syntax highlighting script, Syntaxer, written in PHP to Pike. Mostly for the fun of it but also to improve my knowledge of string handling in Pike. The greatest concern here is that PHP is a dynamic language and Pike is not (in the same sense) and the PHP version of Syntaxer heavily depends on dynamic loading of PHP files. The reason for this is that I generate the “syntax maps” dynamically from syntax files of Edit+. That means that if you want support for a new language just drop a .stx file in the right location and there you go. My script will convert that into a static PHP file, so that the conversion only needs to be done once, and load that file on the fly when that particular language is requested.
I thought that this method would be hard to implement in Pike – although it might be possible – so I had to come up with a slightly different approach. Frankly; it’s not that often you alter the .stx files or implement support for new languages so my solution is to manually create definitions for what ever language. But I still use the .stx files from Edit+ although one needs to copy and paste bit.
In the Pike solution each language is its own class that inherits the master class .stx. The only thing you pretty much need to put in the derived class is some .stx, .stx and .stx that specify what is what in the language. For example, the C++ definition looks like this:
- inherit .Hilite;
- public string title = "C++";
- //| Override the keywords mapping
- private mapping(string:multiset(string)) keywords = ([
- "keywords" : (<
- "auto","bool","break","case","catch","char","cerr","cin",
- "class","const","continue","cout","default","delete","do",
- "double","else","enum","explicit","extern","float","for",
- "friend","goto","if","inline","int","long","namespace","new",
- "operator","private","protected","public","register","return",
- "short","signed","sizeof","static","struct","switch","template",
- "this","throw","try","typedef","union","unsigned","virtual",
- "void","volatile","while","__asm","__fastcall","__based",
- "__cdecl","__pascal","__inline","__multiple_inheritance",
- "__single_inheritance" >),
- "compiler" : (<
- "define","error","include","elif","if","line","else","ifdef","pragma" >)
- ]);
- //| Override the default since # is no line comment in C++
- protected array(string) linecomments = ({ "//" });
- void create()
- {
- ::create();
- colors += ([ "compiler" : "#060" ]);
- styles += ([ "compiler" : ({ "<b>", "</b>" }) ]);
- }
And you really don’t need to make it more fancy than that. For most C-based languages the definitions in the master class .stx is enough. Just add the keywords to the .stx mapping and it looks better than nothing
HTML parser
One thing that differs from the PHP version of Syntaxer is that SGML-based, or tag based, languages will be run through a HTML-parser. The downside of the PHP version is that tag content will be highlighted as well, which of course isn’t what we want, but since Pike has a decent HTML parser that behaves like a SAX parser so I wrote a class, .stx, that uses that for highlighting tag based stuff. The .stx class also inherits .stx so the methods and members are the same.
I wonder why there’s no, built-in, HTML parser for PHP?
A Roxen tag module
Of course I had to write a Roxen tag module so that we can highlight source code in Roxen web pages. This was the reason for writing the Pike module at all. The tag is named .stx which might not be the most innovative name but what the heck! The beauty of it is that I made it possible, in the module settings tab, to create a surrounding HTML template for the output. When you run some code through the parser you get the highlighted source code as well as the name of the language and how many lines of code was highlighted and it might be nice to present that as well (just like the code blocks on this site). It’s tedious writing that surrounding HTML every time so now it’s just to put that in the settings and the code blocks will always look the same.
Finally
There’s some stuff left to do but the code works well enough to be usable. And I must say that the speed of the Pike version is like a thousand times faster than the PHP version!
Oh, and I have implemented support for the following language:
- ActionScript
- C
- C++
- C#
- CSS
- Java
- JavaScript
- HTML
- Perl
- PHP
- Pike
- Python
- Ruby
- RXML
- XSL
And that’s that for now.
Codify RXML tag and Syntaxer.pmod 17:31, Sat 17 October 2009 :: 183.9 kB
Syntaxer 2.0.2 released
A few bugs was fixed in this release. I also noted a few new ones but that was mostly in the SyntaxMap class which gererates PHP arrays from the SyntaxMap files.
Changelog for this version
- Fixed a bug where the end of strings (quotes) wasn’t found correctly in languages that has no escape character (like XSL, XML and HTML).
- Also changed the Syntaxer::AutoDetect() method to return the extension of a file if no alias was found. In this way we can pass a path to the method and use the result directly as argument to `Sytaxer::__construct()`
- $file = '/some/path/to/file.xsl';
- $lang = Syntaxer::AutoDetect($file);
- $stx = new Syntaxer($lang);
- $stx->Parse(file_get_contents($file));
The Syntaxer can be read about and downloaded over here.
Syntaxer 2.0.1 released
I’ve released a new version of my generic syntax highlighting script Syntaxer. I fixed a potential bug where code generated on an operating system that only use \r to define a newline would be messed up (thank you jOOOL at PHPPortalen. Now I’ve bullet proofed (I hope) the way newlines are handled: I replace all \r\n with \r and then replace all \r with \n which means that we always end up with a single \n as the newline character.
New syntax highlighting script
A couple of years ago I wrote a generic syntax highlighting script. What I did was using the syntax files from Edit+ to determine how to parse a given language. All languages have different keywords, function names, delimiters and so on, and to know how to highlight a certain language you need to know these things. The Edit+ .stx files describes all these things.
Since I have gotten a few years more of knowledge, and PHP5 has arrived, I though I should write a new version of it. I could reuse some of the code but alot was rewritten and redesigned totally. The script has two classes:
- One class to parse the Edit+ syntax files which gets converted into PHP files so that the syntax files doesn’t have to to be parsed for every request. If the given
.stxfile has a newer timestamp than the cached PHP file the PHP file will be regenerated. Alot of this code could be reused from the older version - The actual highlighting class. This class was almost entirely rewritten. Here I loop through every character of the code to highlight. When a keyword, delimiter or something else detectable is matched I grab that and searches forward to where the rule ends. In the older version I had a different approach where the code was splitted on newlines so I looped through line by line and for each line I looped throuh each charachter and did a similar match as in the new version.
The new approach has some advantages:
- There’s no need to duplicate the code wich means a lot less memory is used.
- Fewer flags is needed since when I match a detectable rule I at once search for the end of the rule. This means that fewer
.stxstatements is needed wich speeds thing up alot.And foremost the code got much much cleaner!
Anyhow! There are a few minor bugs but the code is pretty usable (I have implemented it here in the blogging system). I added the scripts with documentation and a simple example on the server for anyone to download.
The Syntaxer2 can be found and downloaded over here. Some bug fixes and more examples will be done in the very near future.


