Reconstructing Clarion's XML tooling, Part 1

If you want to do XML with Clarion you have a few options. A lot of developers opt for one or more of the available third party offerings. These include:

(If I've missed any useful Clarion XML third party products please post a comment below.)

What we sometimes forget is that Clarion itself ships with some extensive XML parsing capabilities. Years ago TopSpeed licensed the CenterPoint XML C++ class library (which is itself, reportedly, built on the open source Expat XML parser) and incorporated a wrapper for that library into the Clarion product line.

DOM and SAX

The Clarion XML toolset includes both DOM and SAX parsers. A DOM parser reads an XML file and creates a corresponding Document Object Model (DOM) in memory. A SAX parser reads an XML file and fires off events as it encounters the different parts of the document (elements, attributes etc.). SAX parsers are generally viewed as better for really large documents, but as I don't deal with really large XML documents I've only ever used DOM parsers. 

The Clarion code

Clarion ships with a bunch of code to support the use of the CenterPoint parsers. You can find this code in the following files:

  • cpxml.inc
  • cpxml.clw
  • cpxmlif.inc

There are some additional source files that add support for translating XML to and from Clarion structures, and I'll get to those at some future date. 

I'm really interested in this core code, however, because it promises me the kind of low level access to XML documents I want and need. 

But there's a problem. 

Making sense out of the cpxml* files

The cpxml* files are nearly unreadable, for the simple reason that they contain too much code. So I began by looking for specific clues. 

Here's what I get when I search all three files for classes (using Notepad++):

First of all I'd like to know what a UI component (based on the ABC WindowManager class) is doing in a core XML parser library, but I suppose I can let that slip for now. VLB stands for Virtual List Box so it's probably something to do with displaying XML data, perhaps for debugging purposes. 

But what are the other classes for? Apart from SAXParserClass none of them seems to have anything obviously to do with XML. 

Searching for interfaces, however, tells a different story:

I'm not sure (yet) why the interfaces are duplicated between cpxml.inc and cpxmlif.inc, but clearly this is where the good stuff is. To begin with I'll focus my attention on cpxml.inc. 

The Clarion help has this definition of an interface:

An INTERFACE is a structure, which contains the methods (PROCEDUREs) that define the behavior to be implemented by a CLASS. It cannot contain any property declarations. All methods defined within the INTERFACE are implicitly virtual.

I'll assume for the purposes of this discussion that you already have some understanding of interfaces.

Examining the interfaces

If you look at the DOMWriter interface in the source editor you'll see this:

This looks just like a standard Clarion interface. And when I started working with the XML library I was able to use it just like a standard Clarion user interface, but I noticed something odd: I never got any code completion on the interface methods. 

As it turns out there's some jiggery-pokery going on here. Scroll all the way to the right in the source editor and you'll see the rest of the story:

Get rid of the extra spaces, replace the line continuation characters with line breaks, and format the code according to the Clarion standard and you'll end up with this:

DOMWriter           interface
                    END
                    map
                        module('DOMWriter')
setEncoding                 PROCEDURE(*DOMWriter, const *CSTRING encoding), name('_setEncoding@8'), pascal, raw
setEncoding                 PROCEDURE(*DOMWriter, const *CSTRING encoding, bool assumeISO88591), name('_setEncoding2@12'), pascal, raw
getEncoding                 PROCEDURE(*DOMWriter), *CSTRING, name('_getEncoding@4'), pascal, raw
getLastEncoding             PROCEDURE(*DOMWriter), *CSTRING, name('_getLastEncoding@4'), pascal, raw
setFormat                   PROCEDURE(*DOMWriter, UNSIGNED format), name('_setFormat@8'), pascal, raw
getFormat                   PROCEDURE(*DOMWriter), UNSIGNED, name('_getFormat@4'), pascal, raw
setNewLine                  PROCEDURE(*DOMWriter, const *CSTRING newLine), name('_setNewLine@8'), pascal, raw
getNewLine                  PROCEDURE(*DOMWriter), *CSTRING, name('_getNewLine@4'), pascal, raw
writeNode                   PROCEDURE(*DOMWriter, *Node pNode), *CSTRING, name('_writeNode@8'), pascal, raw
writeNode                   PROCEDURE(*DOMWriter, const *CSTRING systemId, *Node pNode),cbool, name('_writeNode2@12'), pascal, raw
                        end
                    end

It turns out the interface itself is an empty structure, which is why I never got any code completion. Instead of interface methods there are procedure declarations which take the interface as the first parameter. 

As you may already know the first parameter of any class's (or interface's) method is the class (or interface) itself, which is why these functions behave in exactly the same way as if they were interface methods. Each function is a call into the SoftVelocity wrapper around the CenterPoint C++ XML library. 

The problem

This technique of marrying Clarion interfaces to the C++ library is tricky and most interesting, but it raises a concern, at least in my mind. In other languages I'm used to modeling XML documents with a set of classes, but here I don't have any classes at all, just interfaces. How do I work with the XML data? 

Luckily, I've been down this path before. Close to a decade ago I wrote an article in Clarion Magazine on how to create an XML document using the then-C6 DOM parser. Here's the code, which was pretty much a straight port of some of the C++ example code:

DOMRss               PROCEDURE                             ! 

DOMImpl      &DomImplementation
docType      &DocumentType
pDoc         &Document
pText        &Text
pRootElem    &Element
pElement     &Element
pCommentElem &Comment
pChannelElem &Element
pItemElement &Element
writer       &DomWriter
pCData       &CDATASection
encoding     cstring('UTF-8')
s            cciCStringFactory
  CODE
  DomImpl &= CreateDomImplementation()
  DocType &= null
  pDoc &= DomImpl.CreateDocument(s.c(''),s.c('rss'),DocType)

  pRootElem &= pDoc.GetDocumentElement()
  pRootElem.SetAttribute(s.c('version'),s.c('0.91'))
  pCommentElem &= pDoc.CreateComment(s.c('ClarionMag DOM RSS example'))
  pDoc.InsertBefore(pCommentElem,pRootElem)
  pCommentElem.Release()

  pChannelElem &= pDoc.CreateElement(s.c('channel'))

  pElement &= pDoc.CreateElement(s.c('title'))
  pText &= pDoc.CreateTextNode(s.c('Clarion News'))
  pElement.appendChild(pText)
  pText.Release()
  pChannelElem.AppendChild(pElement)
  pElement.Release

  pElement &= pDoc.CreateElement(s.c('description'))
  pText &= pDoc.CreateTextNode(s.c('News, product announcements, |
    & 'and other items of interest to Clarion developers'))
  pElement.appendChild(pText)
  pText.Release()
  pChannelElem.AppendChild(pElement)
  pElement.Release

  pElement &= pDoc.CreateElement(s.c('language'))
  pText &= pDoc.CreateTextNode(s.c('en-us'))
  pElement.appendChild(pText)
  pText.Release()
  pChannelElem.AppendChild(pElement)
  pElement.Release

  pElement &= pDoc.CreateElement(s.c('link'))
  pText &= pDoc.CreateTextNode(s.c('http://www.clarionmag.com'))
  pElement.appendChild(pText)
  pText.Release()
  pChannelElem.AppendChild(pElement)
  pElement.Release

  pElement &= pDoc.CreateElement(s.c('copyright'))
  pText &= pDoc.CreateTextNode(s.c('Copyright 1999-2003 by CoveComm Inc.'))
  pElement.appendChild(pText)
  pText.Release()
  pChannelElem.AppendChild(pElement)
  pElement.Release

  ! Add first <item>

  pItemElement &= pDoc.CreateElement(s.c('item'))

  pElement &= pDoc.CreateElement(s.c('title'))
  pText &= pDoc.CreateTextNode(s.c('RPM Email Survey'))
  pElement.appendChild(pText)
  pText.Release()
  pItemElement.AppendChild(pElement)
  pElement.Release

  pElement &= pDoc.CreateElement(s.c('link'))
  pText &= pDoc.CreateTextNode(s.c('http://www.cwaddons.com/email/'))
  pElement.appendChild(pText)
  pText.Release()
  pItemElement.AppendChild(pElement)
  pElement.Release

  pElement &= pDoc.CreateElement(s.c('description'))
  pText &= pDoc.CreateTextNode(s.c('Lee White is looking for ' |
    & 'feedback, from current and prospective RPM users, ' |
    & 'about email support in RPM.'))
  pElement.appendChild(pText)
  pText.Release()
  pItemElement.AppendChild(pElement)
  pElement.Release

  pChannelElem.AppendChild(pItemElement)
  pItemElement.Release()

  ! Add second <item>

  pItemElement &= pDoc.CreateElement(s.c('item'))
  pElement &= pDoc.CreateElement(s.c('title'))
  pText &= pDoc.CreateTextNode(s.c('True Edit-In-Place Template'))
  pElement.appendChild(pText)
  pText.Release()
  pItemElement.AppendChild(pElement)
  pElement.Release

  pElement &= pDoc.CreateElement(s.c('link'))
  pText &= pDoc.CreateTextNode(s.c('http://www.audkus.dk'))
  pElement.appendChild(pText)
  pText.Release()
  pItemElement.AppendChild(pElement)
  pElement.Release

  pElement &= pDoc.CreateElement(s.c('description'))
  pText &= pDoc.CreateTextNode(s.c('This new  EIP Template ' |
    & 'adds full template support for the Clarion ' |
    & 'edit-in-place list box. ABC templates only; ' |
    & 'includes source and future updates.'))
  pElement.appendChild(pText)
  pText.Release()

  pCData &= pDoc.CreateCDATASection(s.c('This is some text in a CDATA section'))
  pElement.AppendChild(pCData)
  PCData.Release()

  pItemElement.AppendChild(pElement)
  pElement.Release

  pChannelElem.AppendChild(pItemElement)
  pItemElement.Release()

  pRootElem.AppendChild(pChannelElem)
  pChannelElem.Release()

  Writer &= CreateDOMWriter()
  Writer.setFormat(format:reformatted)
  if Writer.writeNode(s.c('domrss.xml'),pDoc).
  pDoc.Release()

  DestroyDomImplementation(DomImpl)

There are a couple of issues with this code. One is that all of my interface references (pText, pElement etc.) have a really ugly "p" prefix. I definitely can't call them Text, Element etc. because those are the actual names of the interfaces and I'll get symbol collisions. Giving the interfaces an appropriate prefix (see below) will solve that problem. 

The bigger issue, however, is I'm limited in what I can do with the XML document as it only exists in memory allocated by the C++ library. What if I wanted to validate an XML node, or automatically populate it with some attributes when it was created? I don't have those kind of options.

What I really want is a set of Clarion (not C++) classes that I can use to model an XML document. That way I can add whatever code I need anywhere I need it. Of course I'll still use the existing library to read and write the XML documents by using the existing interfaces and API calls. 

Gimme dem classes!

The first thing I do when I start any refactoring project is hit Ctrl-A and Ctrl-I to reformat the code to my liking. It's also something I do regularly while I code, such as after moving or changing a block of code in a way that alters indenting. 

The next thing was to come up with a set of actual classes (TYPEd, of course) that model the already-defined interfaces (which themselves are a pretty straightforward model of the DOM specification). 

The more classes I write, the more adamant I become about two rules:

  1. Each class must be in its own .INC/.CLW file pair, and
  2. Each class (and interface) needs a descriptive, prefixed name

There will always be exceptions, but following these two rules will make any class library much easier to understand. 

I contacted Bob Zaunere and asked for and received permission to refactor these classes, and they'll eventually be part of SV's future community repository on GitHub. As much as possible I try to follow the example of .NET when it comes to class naming, but since Clarion doesn't support namespaces I use text to mimic .NET naming. And because these are core SV classes, instead of my usual standard I've used the System_XML_ prefix. It's a little presumptuous I suppose, but as I seem to be the first one in the pool....

I renamed selected interfaces to classes as follows:

Old name

New name
AttrSystem_XML_Attr
CDATASectionSystem_XML_CDATASection
CharacterDataSystem_XML_CharacterData
CommentSystem_XML_Comment
DocumentSystem_XML_Document
ElementSystem_XML_Element
NodeSystem_XML_Node
NodeListSystem_XML_NodeList
NotationSystem_XML_Notation
TextSystem_XML_Text

I settled on some new class names as well, but for the most part they don't figure into the first round of refactoring. The only one that's radically different is CStringClass, which is a fairly small class used to manufacture CStrings which are needed when communicating with the CenterPoint XML library. 

Old nameNew name
CStringClassSystem_XML_CString

String creation is a capability needed in other places than just XML so originally I wanted to give it a more general name, but to avoid introducing any outside dependencies for now I've simply called it System_XML_CString. That will probably change in the near future. 

I have a pet peeve about using "Class" in the name of a class. I suppose this harkens back to naming conventions like Hungarian notation, where it was helpful and necessary to be able to infer the data type from the variable name. There's no need for this from a code safety point of view since classes are strongly typed, and the compiler will complain if you try to use a class as something other than a class. For readability there's code completion. So adding "Class" to a class name is arguably just noise. And in fact within the ABC library the term "Class" is added to class names inconsistently. 

Creating the class template

I almost never write a class from scratch anymore. Instead I use John Hickey's excellent ClarionLive Class Creator and create my classes from standard class templates. Those aren't templates in the Clarion sense; they're class files that contain the basic structure of the class I want to create. 

When I create a class I'm almost always imagining that at some time in the future (when the code is stable) it will be compiled into, and exported from, a DLL. And that means I need some compiler directives for the LINK and DLL attributes. You've probably seen these before in the form of the _ABCDLLMode_ and _ABCLinkMode_ symbols. And if you've ever had to set these manually, for whatever reason, you probably know the disastrous consequences (read: GPF) of getting them wrong. 

Here's my class header template file:

    include('System_XML_IncludeInAllClassHeaderFiles.inc'),once

System_XML_BaseClass                        Class,Type,Module('System_XML_BaseClass.CLW')|
                                                  ,Link('System_XML_BaseClass.CLW',_System_XML_Classes_LinkMode_)|
                                                  ,Dll(_System_XML_Classes_DllMode_)
Construct                                       Procedure()
Destruct                                        Procedure()
                                            End

Looks painful to type, right? That's why it's in a template file, so I don't have to keep typing it. But if all I do is have it in the class template I'll still need to make sure it gets into the project somehow, and I don't have to have to type it there either. That's something the application (*.tp?) templates usually do for us, but in this case I don't yet have a template, I just have a bunch of source code. So instead I create a standard header file that sets these symbols. Here's System_XML_IncludeInAllClassHeaderFiles.inc:

!----------------------------------------------------------------------------
! While in the development phase default to classes being compiled. Once
! the code is stable and a DLL is provided the following three lines can 
! be removed.
!----------------------------------------------------------------------------
    omit('***',_Compile_System_XML_Class_Source_)
_Compile_System_XML_Class_Source_           equate(1)
    ***

    OMIT('***',_Compile_System_XML_Class_Source_)
_System_XML_Classes_LinkMode_           equate(0)
_System_XML_Classes_DllMode_            equate(1)
    ***

    COMPILE('***',_Compile_System_XML_Class_Source_)     
_System_XML_Classes_LinkMode_           equate(1)
_System_XML_Classes_DllMode_            equate(0)
    ***

Now instead of two symbols I only need to worry about one symbol: _Compile_System_XML_Class_Source_

And I don't even have to worry about that one, since I've added a bit of code at the top to default _Compile_System_XML_Class_Source_ to 1 so the classes will always be compiled. At some future point when the classes are stable and I'm providing a DLL I can remove the first Omit statement and using the classes from a DLL will be the default (which can be overridden by setting _Compile_System_XML_Class_Source_ to true in the project). 

Clear as mud? Good. 

Here's the class template for the .CLW file:

                                            Member
                                            Map
                                            End
                    
    Include('System_XML_BaseClass.inc'),Once
    !include('System_Logger.inc'),once
!dbg                                     System_Logger
System_XML_BaseClass.Construct                     Procedure()
    code
    
System_XML_BaseClass.Destruct                      Procedure()
    code  

There's no real code here yet, but I always put a constructor and destructor in the template because most of the time I end up needing one or both. And they serve as an example of how methods are implemented. 

 When I create a class using the Class Creator it will replace System_XML_BaseClass with whatever class name I specify. 

The XML classes

I'll go into the actual classes and some unit tests next time, but here's the list of source files as it stands now. There are some I haven't yet implemented, and I've also added a few new classes which I'll explain in a future article:

System_XML_Attr.clw
System_XML_Attr.inc
System_XML_AttrList.clw
System_XML_AttrList.inc
System_XML_CDataSection.clw
System_XML_CDataSection.inc
System_XML_CenterPointInterfaces.inc
System_XML_CharacterData.clw
System_XML_CharacterData.inc
System_XML_Comment.clw
System_XML_Comment.inc
System_XML_CString.clw
System_XML_CString.inc
System_XML_Document.clw
System_XML_Document.inc
System_XML_Element.clw
System_XML_Element.inc
System_XML_IncludeInAllClassHeaderFiles.inc
System_XML_Node.clw
System_XML_Node.inc
System_XML_NodeBuffer.clw
System_XML_NodeBuffer.inc
System_XML_NodeList.clw
System_XML_NodeList.inc
System_XML_Notation.clw
System_XML_Notation.inc
System_XML_Text.clw
System_XML_Text.inc
System_XML_XPath.inc
System_XML_XPathQuery.clw
System_XML_XPathQuery.inc

Next time: some unit tests and an explanation of how to read and write basic XML. 

And for those of you who just can't wait, here's the source.