RFC[01] - nZyme C++ code segment

George Makrydakis gmakmail at gmail.com
Mon Mar 13 14:54:14 PST 2006


Greetings to everybody in the *LFS projects.

PART 1: Sum of the previous bash work - for new readers of these posts.
PART 2: Current Status and Code Segments in C++ - for the seasoned reader.


PART 1: Historical Review
----------------------------

As most of you know, I have been working on a fully self  - hosted alfs solution based on C++ after choosing it as the development language. The following post 
is one concerning one of the most essential parts of the project, namely a fast, small and easy to debug / develop XML "parsing" class. Initially, the work was 
done for providing a fully self hosted bash based solution to integrate into the jhalfs project; after talking to the jhalfs team I decided to move to a regular 
  programming language that would prove more adequate than a script - based solution. Thus this is not a portion for a bash based alfs solution but a C++ one.

For reviewing purposes, please consider reading the following messages:

x2sh revised:
Thu Feb 23 20:17:30 MST 2006
> http://linuxfromscratch.org/pipermail/alfs-discuss/2006-February/007641.html

The above link describes a pure bash based method for extracting the significant data out of the entire book in about 1 (one) min using nothing but bash built 
in commands for string manipulation (no third party binaries). This is a branch currently developed for personal use in other settings.

x2sh old "library - like" bash source
Mon Feb 20 22:14:33 MST 2006
> http://linuxfromscratch.org/pipermail/alfs-discuss/2006-February/007633.html

The above link contains a character by character implementation of an XML parser (FSM - finite state machine like) as posted in the mailing list, _different_ 
from the one above. Very slow but functional (the revised version "parses" the book 60x times faster though...), consider it an abandoned prototype, posted 
mainly for archiving purposes. Check the mailing list and the irc logs of the alfs-dev channel for discussions with official developers of the jhalfs / nalfs 
project.

PART 2: Current Status and Code Segments in C++
-----------------------------------------------

When I started building this (yet another) XML "parser", the main goal was to have as few lines of code as possible, making it easier for me and others to 
maintain / modify it. This parser tries to avoid as much as possible relying on a classical FSM prototype (Finite State Machine) that uses character by 
character recognition of the semantically important strings. The FSM is to be used in a later setting for another use within the parsing module, but not as the 
main engine for extracting significant XML syntax patterns or error controlling them. This is more of an approach and not a dogmatic view of things. Whatever is 
to be built for this project is to be done using the standard C++ STL classes and the like for the sake of having it fully self hosted and as small as possible, 
and as portable as possible.

Let us analyze the structure of a typical XML document, within which we find the following patterns when declaring semantically significant data of unequivocal 
presence whenever and wherever met:

	01.)	<!ATTLIST ...>
	02.)	<!ELEMENT ...>
	03.)	<!ENTITY ...>
	04.)	<!NOTATION ...>
	05.)	<?name ... ?>
	06.)	<name ... />
	07.)	<!-- character permissible but the '--' string to occur -->
	08.)	<![CDATA[text]]>
	09.)	<![IGNORE[declarations]]>
	10.)	<![INCLUDE[declarations]]>
	11.)	<!DOCTYPE ...[declarations]>
	12.)	<name ... >
	13.)	</name>
	14.)	pure text with no special characters whatsoever

The following observations must be made:

OBV-01.)

Alphanumerical strings of non XML reserved characters that are delimited by a single (<,>) sequence bear significant syntactic meaning, while strings delimited 
by a single (>,<) sequence do not. Single (<,<), single (>,>) when encountered outside string tokens extracted from a DTD stream, are considered to be part of 
an invalid / spurious XML file.

OBV-02.)

Whitespace character sequences ( \t\r\v\n) have different separator meaning when within a pair(GL) or within a pair(LG) sequence. The first whitespace sequence 
within a pair (GL) identifies an element, a preprocessing command, a declaration and so on starting immediately after the occurrence of a '<' character.

OBV-03.)

The <!DOCTYPE ... [declarations]...> where the last ... can be a whitespace character sequence (depends on the editor of the file, it is not a violation to do 
so) is the only syntactically significant structure where pair(GG) and pair (LL) can be found without raising a fatal exception. Note the following excerpt from 
the LFS XML source:

...
> <!DOCTYPE sect1 PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN" "http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd" [
> <!ENTITY % general-entities SYSTEM "../general.ent">
> %general-entities;
> ]>
...

This is completely equivalent to the following stream:
...
> <!DOCTYPE sect1 PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN" "http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd" [<!ENTITY % general-entities SYSTEM "../general.ent"> %general-entities;]>
...

Note now:

> <!DOCTYPE ...<!ENTITY...> ... > %general-entities;]>

Same applies for the less commonly used IGNORE and INCLUDE statements, which in any case fall within DOCTYPE and dealing with them is to be considered as 
dealing with the DOCTYPE declaration (DTD is not the !DOCTYPE but rather the dtd file...). For the purposes of this document we will be limiting ourselves to 
DOCTYPE, which is the hierarchically most important global container.

OBV-04.)

Of the 14 patterns that can be met in an XML source stream of characters, member 11 (DOCTYPE) and the combination of 12+13 are to be considered hierarchical 
patterns, with everything else being considered as a stand - alone type.


OBV-05.)

With the notable exception of the consequences of OBV-03, we can deduct that for a given valid XML stream there is a "binary" separation that takes place 
between character streams delimited by pair(LG) who require further parsing and those delimited by pair(GL) who do not. For a moment, we put aside the DOCTYPE 
issue. Such pairs are found mixed with various whitespace character sequences ( \t\r\v\n) in various combinations, but we always follow the pattern below:

[FRAGMENT01]: ...<...>...<...>...<...>...

For a given line of XML source, let's limit ourselves to possible scenarios with the first and the last delimiter characters (<),(>):

[PREFIX...](first of either < or >),...,(last of either < or >)[SUFFIX...]
--------------------------------------------------------------------------

case 1: [PREFIX...](<),...,(>)[SUFFIX...]
case 2: [PREFIX...](>),...,(>)[SUFFIX...]
case 3: [PREFIX...](<),...,(<)[SUFFIX...]
case 4: [PREFIX...](>),...,(<)[SUFFIX...]
case 5: [PREFIX...](NONE),...,(NONE)[SUFFIX...]

Provided FRAGMENT01 succession is met (no pair(GG) or pair (LL) are found), it is easy to convert the above to successive strings like the ones below:

case 1:  (<...>) -----> no prefix, no suffix, fully contains ONE of the aforementioned patterns fully self - contained, parsing required.
case 2: !(<...>) -----> plain character data that do not need to be further parsed, parameter substitution if necessary, but not parsing.


Dividing in this way the entire stream, you can opt whether to parse or not depending on whether there is delimitation or not, because you are working with big 
chunks of strings. Depending on how this is implemented you can also have a very slim session of code doing this, without the need of continuously monitoring 
state by observing the occurrence of specific reserved characters in significantly important sequences. This (5->2) conversion can be seen here, but please 
check the entire code segment as posted towards the end of this post (NOTE: error control has been deliberately omited but it is present and functional in the 
author's repository):

*SAMPLE C++ CODE EXCERPT*

...

while (!xmlLINE.empty())
{
	if (!xmlBUFF.empty()) { xmlLINE = xmlBUFF + " " + xmlLINE; }
	xmlBUFF.clear();
	cTAG = xmlLINE.find(">");
	oTAG = xmlLINE.find("<");
	if (( oTAG == string::npos ) || ( cTAG == string::npos ))
	{
		if (( oTAG == string::npos ) && ( cTAG == string::npos ))
		{
			xmlITEM.push_back(xmlLINE);
			xmlLINE.clear();
			break;
		}
		else if (( oTAG != string::npos ) && ( cTAG == string::npos ))
		{
			xmlBUFF = xmlLINE.substr(oTAG);
			xmlITEM.push_back(xmlLINE.substr(0, oTAG));
			xmlLINE.clear();
			break;						
		}
	}
	else
	{	
		xmlITEM.push_back(xmlLINE.substr(0, oTAG));
		xmlITEM.push_back(xmlLINE.substr(oTAG, cTAG + 1 - oTAG));
		xmlLINE = xmlLINE.substr(cTAG + 1);
	}
}

...

Notice that reserved character presence is not monitored by its presence, but rather by its absence, avoiding the need to monitor with even more elaborate if - 
like control structures or switch - case constructs.

Let's get back at the DOCTYPE issue:

> <!DOCTYPE ...<!ENTITY...> ... > %general-entities;]>

If we directly applied the above method here we would likely fail because of pair(GG) and pair (LL) presence. The problem can be solved if we wrap the entire 
DOCTYPE declaration hierarchy within a single <,> container. the ']>' terminator does not constitute a valid choice at times, for it may be present within 
DOCTYPE in such a way that any whitespace character sequence may separate the ] from the >. Although this is would be "bad practice" for some XML authors, it is 
not something that is unlikely to not be found. A [ or a ] character found within DOCTYPE declaration, is unlikely to be an unequivocal reference point unless a 
lot of other variables are considered, making it not the optimal choice for a "state" changing code block...

This is how this problem can be solved:

> <!DOCTYPE _sect1_ PUBLIC...[...<!ENTITY...> ... > %general-entities;]>... _<sect1_....

Obviously the first non - comment (<!--) (also non <?...) XML semantically significant structure once DTD terminates _HAS_ (not yelling, ever, even if other 
people may think otherwise, plain text data conveys no sound of unequivocal significance/tone/timber/chrome/emotional content :) ) to be the document _root_ 
element; One hierarchy ends (DOCTYPE) another one initiates (root element). There is a single root element and it has that particular position. Everything else 
found there other than it (or comment only, this is omitted for simplicity but there is a functional code segment in the author's depo) means that we have a non 
well - formed document (and possibly invalid/corrupt at times).

(NOTE: the very first [ and the very last ] within a DOCTYPE  declaration are referred to as Declaration Subset Open (DSO) and Declaration Subset Close (DSC)).


Using such concepts as above, the nZyme "tokenizer" for XML is presented, as POC in C++ code, simply cut the code block below. Please avoid using it when 
comments are nested within the DOCTYPE declaration or immediately after it, this is POC code; The version under development incorporates error controls, full 
comment support as well as inline parsing among other features. Care has been given to have as many useful comments as possible, for this a previous build 
without the entirety of current working / under process features has been posted. Comments both public and private are expected for evaluation of the community 
response. If possible, it is preferable to meet with alfs members in the IRC project channels previous appointment.

Thank you,

George Makrydakis

gmak
----

//-------------------------------------------------------------------------------------------------------
#include <iostream>
#include <fstream>
#include <string>
#include <vector>

using namespace std;


/*
  *	The nZyme (sub?) Project
  *
  *	A collection of POC code segments for an XML parser / data extractor / data manipulator
  *	originally created for the ALFS project. The following is a "tokenizer" part.
  *
  *	author	:	George Makrydakis
  *	info	:	gmakmail /at/ gmail >d0t< com
  *	license	:	Pending (OSI Compatible, either GPL or BSD .. in progress)
  *	revision:	A2 - POC branch
  *	date	:	March 11th 2006
  *
  *
  *
  */

	int main (int argc, char **argv)
	{
	
	vector<string> xmlVECT; // vector containing separate raw line segments from the XML file
	vector<string> xmlITEM; // vector containing formatted string tokens as out of the tokenizer

	string xmlBUFF;	// holds a buffered string
	string dtdROOT;	// holds the root element name
	string xmlLINE;	// holds a single member of the xmlVECT vector
	string dtdBUFF;	// holds a buffered string while processing DTD

	int lnct;	// holds a line counter variable
	int cTAG;	// within a given string, index to a usable within code segments '<' character
	int oTAG;	// within a given string, index to a usable within code segments '>' character
	int sTAG;	// within a given string, index to a usable whitespace or non whitespace sequence
	int lnct2;	// holds a line counter variable




	// ---> irrelevant part for the POC STARTS ----------------------------------------------------
	if (argc != 2)
	{
		printf("Usage: %s [VALID, WELL FORMED XML FILE]\n", argv[0]);
		return(-1);
  	}
	ifstream xmlFILE(argv[1]);

	/*	note type:	WARNING
	 *	note date:	Mar/10/2006
	 *	note info:	File loading in memory	
	 *
	 *	Loading the entire file in memory is a quick and dirty solution,
	 *	definetely not the optimal one, but fits its purpose in the POC
	 *	code segment for the parser.
	 *
	 */

	if ( xmlFILE.is_open() )
	{
		while (getline(xmlFILE,xmlBUFF,'\n'))
		{
			xmlVECT.push_back(xmlBUFF.erase(0,xmlBUFF.find_first_not_of(" \t\n\r\v")));
		}
		xmlFILE.close();
		xmlBUFF.clear();
	}
	else
	{
		cout << "file not found!" << endl;
		return -1;
	}

	// ---> irrelevant part for the POC FINISHES --------------------------------------------------


	/*	Technical Preamble & Disclaimer
	 *
	 *	This particular code block is offered as a monolithic procedure - oriented programming
	 *	POC code. The goal of the nZyme XML parsing (sub?)project is to have a token based XML
	 *	parser, since "single" character manipulation issues are forwarded to the C++ String class
	 *	of the STL C++ Library. In anytime the C++ String class can be substituted by one that
	 *	may prove more efficient, but still providing the same exact functionality. This is a fact
	 *	that is independent of the POC code presented in these code segments.
	 *
	 *	It is also rather obvious that a FSM model is to be applied around the principles laid out
	 *	here, especially once the proper OOP oriented aspect of this project is implemented. Having
	 *	a token - oriented FSM rather than a single char - oriented FSM can be less cumbersome to
	 *	both develop and debug.
	 *
	 *	The entire code segment is submitted to the ALFS project developer group as a RFC document.
	 *
	 */

	for (lnct = 0; lnct < xmlVECT.size(); lnct++)
	{
		xmlLINE = xmlVECT.at(lnct);

		/*	POC CODE SEGMENT: <!DOCTYPE "pair - matching" compatibility.
		 *
		 *	The following section is another POC segment, designed especially for the DTD
		 *	portion of the XML document currently under processing, the entire segment is
		 *	conditioned upon the presence of the <!DOCTYPE string token within the given
		 *	line. Tokens begining with <! are to be having their own sections as the OOP
		 *	implementation of the POC presented in these code segments is deployed. Easily
		 *	visible is the use of the dtdROOT string, which once defined can serve also as
		 *	a well - formedness cue for the validation part of this XML parser (Both DTD
		 *	and rest of the text).
		 *
		 */

		if (xmlLINE.find("<!DOCTYPE") != string::npos)
		{
			xmlLINE = xmlLINE.substr(xmlLINE.find("<!DOCTYPE") + 10);
			xmlLINE = xmlLINE.erase(0, xmlLINE.find_first_not_of(" \t\n\r\v"));
			lnct2 = lnct;
			while (dtdROOT.empty())
			{
				sTAG = xmlLINE.find_first_not_of(" \t\n\r\v");
				if (sTAG != string::npos)
				{
					dtdBUFF = xmlLINE.substr(sTAG, xmlLINE.find_first_of(" \t\n\r\v"));
					dtdBUFF = dtdBUFF.erase(0, dtdBUFF.find_first_not_of(" \t\n\r\v"));
					xmlLINE = xmlLINE.substr(sTAG + dtdBUFF.size(), xmlLINE.find_first_of(" \t\n\r\v"));
					if ((dtdBUFF.find_first_of("</[]&;>") == string::npos))
					{
						if (!(( dtdBUFF == "PUBLIC") || ( dtdBUFF == "SYSTEM")))
						{
							while (xmlLINE.find("<" + dtdBUFF) == string::npos)
							{
								/*
								 *	dtdBUFF contains the string that is relative to the dtdROOT element
								 *	simply proceed until the correcline containing the ("<" + dtdBUFF)
								 *	string is met. All lines are joined to one another until that happens.
								 */
								xmlLINE = xmlLINE + "  " + xmlVECT.at(lnct2);
								lnct2++;
							}
							lnct = lnct2;
							xmlITEM.push_back("<!DOCTYPE " + xmlLINE.substr(0, xmlLINE.find("<" + dtdBUFF)));
							xmlLINE = xmlLINE.substr(xmlLINE.find("<" + dtdBUFF)) + xmlVECT.at(lnct);
							dtdROOT = dtdBUFF;
							
						}
						else if (( dtdBUFF == "PUBLIC") || ( dtdBUFF == "SYSTEM"))
						{
							// Error control is elementary, we are dealing with valid + well formed XML for the
							// time being.
							cout << "fatal error! root element not declared within DOCTYPE statement" << endl;
							return 1;
						}
					}
				}
				lnct2++;
				if (dtdROOT.empty()) {xmlLINE = xmlVECT.at(lnct2);}
			}
		}

			/*	POC CODE SEGMENT: Tokenizing according to "pair - matching".
			 *
			 *	The main "tokenizer" code segment is constrained within the following single while() loop.
			 *	The nZyme XML parsing effort is relying upon elaboration of significant / non significant
			 *	string tokens. The following is the section that is solely responsible for providing them,
			 *	according to the pair(LG)/pair(GL) binary identity of the formatted string token to be
			 *	provided for inline parsing.
			 *
			 */

			while (!xmlLINE.empty())
			{
				if (!xmlBUFF.empty()) { xmlLINE = xmlBUFF + " " + xmlLINE; }
				xmlBUFF.clear();
				cTAG = xmlLINE.find(">");
				oTAG = xmlLINE.find("<");
				if (( oTAG == string::npos ) || ( cTAG == string::npos ))
				{
					if (( oTAG == string::npos ) && ( cTAG == string::npos ))
					{
						xmlITEM.push_back(xmlLINE);
						xmlLINE.clear();
						break;
					}
					else if (( oTAG != string::npos ) && ( cTAG == string::npos ))
					{
							xmlBUFF = xmlLINE.substr(oTAG);
							xmlITEM.push_back(xmlLINE.substr(0, oTAG));
							xmlLINE.clear();
							break;						
					}
				}
				else
				{	
					xmlITEM.push_back(xmlLINE.substr(0, oTAG));
					xmlITEM.push_back(xmlLINE.substr(oTAG, cTAG + 1 - oTAG));
					xmlLINE = xmlLINE.substr(cTAG + 1);
				}
			}
		}

	// for screen output, unimportant.

	for (lnct = 0; lnct < xmlITEM.size(); lnct++)
	{
			if (xmlITEM.at(lnct) != "" ) // note: to be gotten rid of.
			{
			cout << xmlITEM.at(lnct) << endl;
			}
	}
	xmlVECT.clear();
	xmlITEM.clear();

	return 0;
	}

//-------------------------------------------------------------------------------------------------------

	




More information about the alfs-discuss mailing list