HowToParseWikiI've been unable to find a good wiki parser online. By good I mean one which is clear, intuitive, and well-separated into components. As well as giving maintainability, it allows the easy extension and disabling of components. The difficulty is getting components to interact nicely.
All wiki's I have seen have a complex parsing model. Early in the parsing various markup is found and made opaque to the parser. Most obvious is nowiki tags. Things like monospace shouldn't respond to certain markup too, though. (Should monospace be emphasisable? It certainly shouldn't be 'small'). Also htings like URLs shouldn't be modifiable because they happen to contain wiki markup. To avoid interaction between rules, rules must be able to cope with opaque blobs to skip over. Every wiki parser I've seen has some hack or other to do this. This wiki, for example, uses 8-bit characters temporarily in parsing.
Most wikis parse from wiki-text to html. I would like to parse to an AST (abstract syntax tree), and then run methods on that AST to, for example, generate html, or plain text, or RSS, or whatever.
These last two requirements fit together to mean that parsing should take place on a sequence of nodes, some transparent and some opaque. Here transparent means that it may contain markup for a rule. Most naturally, this sequence can be the children of a particular node in an AST. So, a third type of node is an unresolved node. An unresolved node is one where parsing can take place by looking for markup in transparent children.
A parsing rule operates on an unresolved node, looking at its transparent children. It then creates a replacement for an unresolved node, incorporating newly-generated nodes of any type derived from transparent children, and any of its opaque subtrees. The replacement can be any of the three types.
Only certain rules should be activated on certain unresolved nodes, according to their position in the tree. Therefore there are different types of unresolved nodes. Rules only act on a subset of (usually one) types of unresolved node. The type of unresolved nodes should be hierarchical. The unresolved nodes naturally correspond to an attribute of the transparent text underneath that unresolved node. For example, a whole page, a paragraph block, rich text, and so on. For example, a rule to parse named external links should operated before the rich-text layer, removing the URL part and placing the text-part in a rich-text block, beneath an opaque "link" node. Then outer rich-markup (such as emphasis) will not affect the URL, or be modifiable within the link (preventing <em><a></em></a> non-nesting problems) but allow internal rich markup within the URL text.
Once all rules have been applied to an unresolved node, any remaining uresolved node is transformed into a new type, according to the original type.
Within the rules applied on a particular unresolved node the leftmost then longest rule applies. This seems to be how wiki syntax tends to work. For example, <nowiki><foo></foo></nowiki> wouldn't give special meaning to <foo>. Only really the five-tick problem, I think, breaks this, meaning that bold and italic need to be considered together, which is quite natural, I suppose, anyway.
There's a problem with the type-2 language problem of <foo><foo></foo></foo> which I need to think about, as its resolution will be different in different circumstances, I think.
Initially the AST has a top level unresolved of type page, with a single transparent child.
Mediawiki uses HTML <foo></foo> type syntax for plugins, which I think is a good idea. I think that should be the top-level of parsing from page to wiki-page for text not within one of these. We then need a break on trailing markup, based around blank lines: wiki-page becomes raw-wiki-block. For normal text, rules for doing lists and so on transform a raw-wiki-block into a wiki-block. wiki-block is similar in level to a CSS block display type. Then rules extract non-text markup (links etc), transforming a wiki-block into rich-text. rich-text is then marked into plain text nodes.
A further problem is omitted containers.