Reconstructing Clarion's XML tooling, Part 1
If you want to do XML with Clarion you have a few options. A lot of developers opt for one or more of the available third party offerings. These include:
- Capesoft's XFiles (easy conversion to and from Clarion data structures, with some limitations)
- ThinkData's xmlFuse, a COM wrapper around Microsoft's XML parser
- Robert Paresi's iQ-XML, free and highly regarded but source code is not available
(If I've missed any useful Clarion XML third party products please post a comment below.)
What we sometimes forget is that Clarion itself ships with some extensive XML parsing capabilities. Years ago TopSpeed licensed the CenterPoint XML C++ class library (which is itself, reportedly, built on the open source Expat XML parser) and incorporated a wrapper for that library into the Clarion product line.
DOM and SAX
The Clarion XML toolset includes both DOM and SAX parsers. A DOM parser reads an XML file and creates a corresponding Document Object Model (DOM) in memory. A SAX parser reads an XML file and fires off events as it encounters the different parts of the document (elements, attributes etc.). SAX parsers are generally viewed as better for really large documents, but as I don't deal with really large XML documents I've only ever used DOM parsers.
The Clarion code
Clarion ships with a bunch of code to support the use of the CenterPoint parsers. You can find this code in the following files:
- cpxml.inc
- cpxml.clw
- cpxmlif.inc
There are some additional source files that add support for translating XML to and from Clarion structures, and I'll get to those at some future date.
I'm really interested in this core code, however, because it promises me the kind of low level access to XML documents I want and need.
But there's a problem.
Making sense out of the cpxml* files
The cpxml* files are nearly unreadable, for the simple reason that they contain too much code. So I began by looking for specific clues.
Here's what I get when I search all three files for classes (using Notepad++):
First of all I'd like to know what a UI component (based on the ABC WindowManager class) is doing in a core XML parser library, but I suppose I can let that slip for now. VLB stands for Virtual List Box so it's probably something to do with displaying XML data, perhaps for debugging purposes.
But what are the other classes for? Apart from SAXParserClass none of them seems to have anything obviously to do with XML.
Searching for interfaces, however, tells a different story:
I'm not sure (yet) why the interfaces are duplicated between cpxml.inc and cpxmlif.inc, but clearly this is where the good stuff is. To begin with I'll focus my attention on cpxml.inc.
The Clarion help has this definition of an interface:
An INTERFACE is a structure, which contains the methods (PROCEDUREs) that define the behavior to be implemented by a CLASS. It cannot contain any property declarations. All methods defined within the INTERFACE are implicitly virtual.
I'll assume for the purposes of this discussion that you already have some understanding of interfaces.
Examining the interfaces
If you look at the DOMWriter interface in the source editor you'll see this:
This looks just like a standard Clarion interface. And when I started working with the XML library I was able to use it just like a standard Clarion user interface, but I noticed something odd: I never got any code completion on the interface methods.
As it turns out there's some jiggery-pokery going on here. Scroll all the way to the right in the source editor and you'll see the rest of the story:
Get rid of the extra spaces, replace the line continuation characters with line breaks, and format the code according to the Clarion standard and you'll end up with this:
DOMWriter interface END map module('DOMWriter') setEncoding PROCEDURE(*DOMWriter, const *CSTRING encoding), name('_setEncoding@8'), pascal, raw setEncoding PROCEDURE(*DOMWriter, const *CSTRING encoding, bool assumeISO88591), name('_setEncoding2@12'), pascal, raw getEncoding PROCEDURE(*DOMWriter), *CSTRING, name('_getEncoding@4'), pascal, raw getLastEncoding PROCEDURE(*DOMWriter), *CSTRING, name('_getLastEncoding@4'), pascal, raw setFormat PROCEDURE(*DOMWriter, UNSIGNED format), name('_setFormat@8'), pascal, raw getFormat PROCEDURE(*DOMWriter), UNSIGNED, name('_getFormat@4'), pascal, raw setNewLine PROCEDURE(*DOMWriter, const *CSTRING newLine), name('_setNewLine@8'), pascal, raw getNewLine PROCEDURE(*DOMWriter), *CSTRING, name('_getNewLine@4'), pascal, raw writeNode PROCEDURE(*DOMWriter, *Node pNode), *CSTRING, name('_writeNode@8'), pascal, raw writeNode PROCEDURE(*DOMWriter, const *CSTRING systemId, *Node pNode),cbool, name('_writeNode2@12'), pascal, raw end end
It turns out the interface itself is an empty structure, which is why I never got any code completion. Instead of interface methods there are procedure declarations which take the interface as the first parameter.
As you may already know the first parameter of any class's (or interface's) method is the class (or interface) itself, which is why these functions behave in exactly the same way as if they were interface methods. Each function is a call into the SoftVelocity wrapper around the CenterPoint C++ XML library.
The problem
This technique of marrying Clarion interfaces to the C++ library is tricky and most interesting, but it raises a concern, at least in my mind. In other languages I'm used to modeling XML documents with a set of classes, but here I don't have any classes at all, just interfaces. How do I work with the XML data?
Luckily, I've been down this path before. Close to a decade ago I wrote an article in Clarion Magazine on how to create an XML document using the then-C6 DOM parser. Here's the code, which was pretty much a straight port of some of the C++ example code:
DOMRss PROCEDURE ! DOMImpl &DomImplementation docType &DocumentType pDoc &Document pText &Text pRootElem &Element pElement &Element pCommentElem &Comment pChannelElem &Element pItemElement &Element writer &DomWriter pCData &CDATASection encoding cstring('UTF-8') s cciCStringFactory CODE DomImpl &= CreateDomImplementation() DocType &= null pDoc &= DomImpl.CreateDocument(s.c(''),s.c('rss'),DocType) pRootElem &= pDoc.GetDocumentElement() pRootElem.SetAttribute(s.c('version'),s.c('0.91')) pCommentElem &= pDoc.CreateComment(s.c('ClarionMag DOM RSS example')) pDoc.InsertBefore(pCommentElem,pRootElem) pCommentElem.Release() pChannelElem &= pDoc.CreateElement(s.c('channel')) pElement &= pDoc.CreateElement(s.c('title')) pText &= pDoc.CreateTextNode(s.c('Clarion News')) pElement.appendChild(pText) pText.Release() pChannelElem.AppendChild(pElement) pElement.Release pElement &= pDoc.CreateElement(s.c('description')) pText &= pDoc.CreateTextNode(s.c('News, product announcements, | & 'and other items of interest to Clarion developers')) pElement.appendChild(pText) pText.Release() pChannelElem.AppendChild(pElement) pElement.Release pElement &= pDoc.CreateElement(s.c('language')) pText &= pDoc.CreateTextNode(s.c('en-us')) pElement.appendChild(pText) pText.Release() pChannelElem.AppendChild(pElement) pElement.Release pElement &= pDoc.CreateElement(s.c('link')) pText &= pDoc.CreateTextNode(s.c('http://www.clarionmag.com')) pElement.appendChild(pText) pText.Release() pChannelElem.AppendChild(pElement) pElement.Release pElement &= pDoc.CreateElement(s.c('copyright')) pText &= pDoc.CreateTextNode(s.c('Copyright 1999-2003 by CoveComm Inc.')) pElement.appendChild(pText) pText.Release() pChannelElem.AppendChild(pElement) pElement.Release ! Add first <item> pItemElement &= pDoc.CreateElement(s.c('item')) pElement &= pDoc.CreateElement(s.c('title')) pText &= pDoc.CreateTextNode(s.c('RPM Email Survey')) pElement.appendChild(pText) pText.Release() pItemElement.AppendChild(pElement) pElement.Release pElement &= pDoc.CreateElement(s.c('link')) pText &= pDoc.CreateTextNode(s.c('http://www.cwaddons.com/email/')) pElement.appendChild(pText) pText.Release() pItemElement.AppendChild(pElement) pElement.Release pElement &= pDoc.CreateElement(s.c('description')) pText &= pDoc.CreateTextNode(s.c('Lee White is looking for ' | & 'feedback, from current and prospective RPM users, ' | & 'about email support in RPM.')) pElement.appendChild(pText) pText.Release() pItemElement.AppendChild(pElement) pElement.Release pChannelElem.AppendChild(pItemElement) pItemElement.Release() ! Add second <item> pItemElement &= pDoc.CreateElement(s.c('item')) pElement &= pDoc.CreateElement(s.c('title')) pText &= pDoc.CreateTextNode(s.c('True Edit-In-Place Template')) pElement.appendChild(pText) pText.Release() pItemElement.AppendChild(pElement) pElement.Release pElement &= pDoc.CreateElement(s.c('link')) pText &= pDoc.CreateTextNode(s.c('http://www.audkus.dk')) pElement.appendChild(pText) pText.Release() pItemElement.AppendChild(pElement) pElement.Release pElement &= pDoc.CreateElement(s.c('description')) pText &= pDoc.CreateTextNode(s.c('This new EIP Template ' | & 'adds full template support for the Clarion ' | & 'edit-in-place list box. ABC templates only; ' | & 'includes source and future updates.')) pElement.appendChild(pText) pText.Release() pCData &= pDoc.CreateCDATASection(s.c('This is some text in a CDATA section')) pElement.AppendChild(pCData) PCData.Release() pItemElement.AppendChild(pElement) pElement.Release pChannelElem.AppendChild(pItemElement) pItemElement.Release() pRootElem.AppendChild(pChannelElem) pChannelElem.Release() Writer &= CreateDOMWriter() Writer.setFormat(format:reformatted) if Writer.writeNode(s.c('domrss.xml'),pDoc). pDoc.Release() DestroyDomImplementation(DomImpl)
There are a couple of issues with this code. One is that all of my interface references (pText, pElement etc.) have a really ugly "p" prefix. I definitely can't call them Text, Element etc. because those are the actual names of the interfaces and I'll get symbol collisions. Giving the interfaces an appropriate prefix (see below) will solve that problem.
The bigger issue, however, is I'm limited in what I can do with the XML document as it only exists in memory allocated by the C++ library. What if I wanted to validate an XML node, or automatically populate it with some attributes when it was created? I don't have those kind of options.
What I really want is a set of Clarion (not C++) classes that I can use to model an XML document. That way I can add whatever code I need anywhere I need it. Of course I'll still use the existing library to read and write the XML documents by using the existing interfaces and API calls.
Gimme dem classes!
The first thing I do when I start any refactoring project is hit Ctrl-A and Ctrl-I to reformat the code to my liking. It's also something I do regularly while I code, such as after moving or changing a block of code in a way that alters indenting.
The next thing was to come up with a set of actual classes (TYPEd, of course) that model the already-defined interfaces (which themselves are a pretty straightforward model of the DOM specification).
The more classes I write, the more adamant I become about two rules:
- Each class must be in its own .INC/.CLW file pair, and
- Each class (and interface) needs a descriptive, prefixed name
There will always be exceptions, but following these two rules will make any class library much easier to understand.
I contacted Bob Zaunere and asked for and received permission to refactor these classes, and they'll eventually be part of SV's future community repository on GitHub. As much as possible I try to follow the example of .NET when it comes to class naming, but since Clarion doesn't support namespaces I use text to mimic .NET naming. And because these are core SV classes, instead of my usual standard I've used the System_XML_ prefix. It's a little presumptuous I suppose, but as I seem to be the first one in the pool....
I renamed selected interfaces to classes as follows:
Old name | New name |
---|---|
Attr | System_XML_Attr |
CDATASection | System_XML_CDATASection |
CharacterData | System_XML_CharacterData |
Comment | System_XML_Comment |
Document | System_XML_Document |
Element | System_XML_Element |
Node | System_XML_Node |
NodeList | System_XML_NodeList |
Notation | System_XML_Notation |
Text | System_XML_Text |
I settled on some new class names as well, but for the most part they don't figure into the first round of refactoring. The only one that's radically different is CStringClass, which is a fairly small class used to manufacture CStrings which are needed when communicating with the CenterPoint XML library.
Old name | New name |
---|---|
CStringClass | System_XML_CString |
String creation is a capability needed in other places than just XML so originally I wanted to give it a more general name, but to avoid introducing any outside dependencies for now I've simply called it System_XML_CString. That will probably change in the near future.
I have a pet peeve about using "Class" in the name of a class. I suppose this harkens back to naming conventions like Hungarian notation, where it was helpful and necessary to be able to infer the data type from the variable name. There's no need for this from a code safety point of view since classes are strongly typed, and the compiler will complain if you try to use a class as something other than a class. For readability there's code completion. So adding "Class" to a class name is arguably just noise. And in fact within the ABC library the term "Class" is added to class names inconsistently.
Creating the class template
I almost never write a class from scratch anymore. Instead I use John Hickey's excellent ClarionLive Class Creator and create my classes from standard class templates. Those aren't templates in the Clarion sense; they're class files that contain the basic structure of the class I want to create.
When I create a class I'm almost always imagining that at some time in the future (when the code is stable) it will be compiled into, and exported from, a DLL. And that means I need some compiler directives for the LINK and DLL attributes. You've probably seen these before in the form of the _ABCDLLMode_ and _ABCLinkMode_ symbols. And if you've ever had to set these manually, for whatever reason, you probably know the disastrous consequences (read: GPF) of getting them wrong.
Here's my class header template file:
include('System_XML_IncludeInAllClassHeaderFiles.inc'),once System_XML_BaseClass Class,Type,Module('System_XML_BaseClass.CLW')| ,Link('System_XML_BaseClass.CLW',_System_XML_Classes_LinkMode_)| ,Dll(_System_XML_Classes_DllMode_) Construct Procedure() Destruct Procedure() End
Looks painful to type, right? That's why it's in a template file, so I don't have to keep typing it. But if all I do is have it in the class template I'll still need to make sure it gets into the project somehow, and I don't have to have to type it there either. That's something the application (*.tp?) templates usually do for us, but in this case I don't yet have a template, I just have a bunch of source code. So instead I create a standard header file that sets these symbols. Here's System_XML_IncludeInAllClassHeaderFiles.inc:
!---------------------------------------------------------------------------- ! While in the development phase default to classes being compiled. Once ! the code is stable and a DLL is provided the following three lines can ! be removed. !---------------------------------------------------------------------------- omit('***',_Compile_System_XML_Class_Source_) _Compile_System_XML_Class_Source_ equate(1) *** OMIT('***',_Compile_System_XML_Class_Source_) _System_XML_Classes_LinkMode_ equate(0) _System_XML_Classes_DllMode_ equate(1) *** COMPILE('***',_Compile_System_XML_Class_Source_) _System_XML_Classes_LinkMode_ equate(1) _System_XML_Classes_DllMode_ equate(0) ***
Now instead of two symbols I only need to worry about one symbol: _Compile_System_XML_Class_Source_
And I don't even have to worry about that one, since I've added a bit of code at the top to default _Compile_System_XML_Class_Source_ to 1 so the classes will always be compiled. At some future point when the classes are stable and I'm providing a DLL I can remove the first Omit statement and using the classes from a DLL will be the default (which can be overridden by setting _Compile_System_XML_Class_Source_ to true in the project).
Clear as mud? Good.
Here's the class template for the .CLW file:
Member Map End Include('System_XML_BaseClass.inc'),Once !include('System_Logger.inc'),once !dbg System_Logger System_XML_BaseClass.Construct Procedure() code System_XML_BaseClass.Destruct Procedure() code
There's no real code here yet, but I always put a constructor and destructor in the template because most of the time I end up needing one or both. And they serve as an example of how methods are implemented.
When I create a class using the Class Creator it will replace System_XML_BaseClass with whatever class name I specify.
The XML classes
I'll go into the actual classes and some unit tests next time, but here's the list of source files as it stands now. There are some I haven't yet implemented, and I've also added a few new classes which I'll explain in a future article:
System_XML_Attr.clw System_XML_Attr.inc System_XML_AttrList.clw System_XML_AttrList.inc System_XML_CDataSection.clw System_XML_CDataSection.inc System_XML_CenterPointInterfaces.inc System_XML_CharacterData.clw System_XML_CharacterData.inc System_XML_Comment.clw System_XML_Comment.inc System_XML_CString.clw System_XML_CString.inc System_XML_Document.clw System_XML_Document.inc System_XML_Element.clw System_XML_Element.inc System_XML_IncludeInAllClassHeaderFiles.inc System_XML_Node.clw System_XML_Node.inc System_XML_NodeBuffer.clw System_XML_NodeBuffer.inc System_XML_NodeList.clw System_XML_NodeList.inc System_XML_Notation.clw System_XML_Notation.inc System_XML_Text.clw System_XML_Text.inc System_XML_XPath.inc System_XML_XPathQuery.clw System_XML_XPathQuery.inc
Next time: some unit tests and an explanation of how to read and write basic XML.
And for those of you who just can't wait, here's the source.