RFC[02] - nZyme C++ code segment

George Makrydakis gmakmail at gmail.com
Tue Mar 21 13:59:01 PST 2006


Hi guys, sorry for top - posting. I know you are all busy with jhalfs; this is an interim update towards a more functional version of the C++ parsing codebase.
This is a more recent version of the project, upon which conversion to OOP is to be based. It becomes evident that the next stage of the development will be 
having two separate "structures" working together: the tokenizer and the element parser. So far effort is concentrated to finding the least amount of code 
necessary using the C++ STL string, without coding a "classical" FSM.

The current version will raise a "fatality" if element +/ attribute definition is incorrect, it will not check whether the element name is valid (for now, this 
is just another statement to add which I deliberately avoided). The most important thing about this version is that it has error control for spurious <,> 
characters within the document, and there is strictness when parsing element name + attributes (whitespaces and double quotes are respected else it raises an 
error saying the document is incorrect). This is from march 17th branch, current branch needs fixing before posting for evaluation purposes (includes the entity 
data structures and the rest and needs more testing before posting). I think that I should keep things up to date, with where this is heading, this is the why 
of the posting.

It is possible to trim everything down even more, so with the addition of the entity data - structures and the XPointer features it should be able to provide 
one of the most compact solutions ever.

background info:
> http://linuxfromscratch.org/pipermail/alfs-discuss/2006-March/007760.html

Note: CDATA sections are not supported for the time being (easy to add, simply the LFS book does not have them).

Thank you, Comments are awaited.

George Makrydakis

---------------------------------------------------------




	// nZyme "parsing" project
	// author: George Makrydakis gmakmail a/t gmail d0t com
	// license: BSD
	// release: revision A5 POC code base
	// scheduling: conversion to OOP class / template structures.
	// labdate:	March 17th 2006

	// WATCH THE LINES WHILE CUT/PASTE
	
	#include <iostream>
	#include <fstream>
	#include <string>
	#include <vector>

	using namespace std;

	int main (int argc, char **argv)
	{

		// despite some inherent inefficiences of the C++ STL string template, we will be using it for parsing XML
		// keeping in mind that we can actually create a more efficient "string" than the STL itself.

		// version: revision A5 - error control implemented, any incorrect syntax should trigger fatal events.
		// TODO: complete validation and incorporate <! full support with the necessary data structure for entity dereferencing.
	
	vector<string> xmlVECT; // vector containing separate raw line segments from the XML file
	vector<string> xmlITEM; // vector containing formatted string tokens as out of the tokenizer

	string xmlBUFF;	// holds a buffered string
	string dtdROOT;	// holds the root element name
	string xmlLINE;	// holds a single member of the xmlVECT vector
	string dtdBUFF;	// holds a buffered string while processing DTD
	string xmlTOKN; // holds a formatted XML string token out of the xmlLINE string
	string xmlCOMM; // holds an unformatted XML string free of comments when those are met, it is a buffer actually.
	string tryme;
	string tempobuffer;

	int lnct;		// holds a line counter variable
	int cTAG;		// within a given string, index to a usable within code segments '<' character
	int oTAG;		// within a given string, index to a usable within code segments '>' character
	int sTAG;		// within a given string, index to a usable whitespace or non whitespace sequence
	int lnct2;		// holds a line counter variable

	if (argc != 2)
	{
		printf("Usage: %s [XML FILE]\n", argv[0]); // note the difference; not always working but 90% of the time, getting close to 100%!
		return(-1);
  	}
	ifstream xmlFILE(argv[1]);

	if ( xmlFILE.is_open() )
	{
		while (getline(xmlFILE,xmlBUFF,'\n'))
		{
			xmlVECT.push_back(xmlBUFF.erase(0,xmlBUFF.find_first_not_of(" \t\n\r\v")));
		}
		xmlFILE.close();
		xmlBUFF.clear();
	}
	else
	{
		cout << "file not found!" << endl;
		return -1;
	}
	// two portions within one program justify some ahead - planning: tokenizing != parsing, so you kind of get the idea how
	// to create the OOP structure correctly from the POC code
	for (lnct = 0; lnct < xmlVECT.size(); lnct++)
	{
		xmlLINE = xmlVECT.at(lnct);
		
		if (xmlLINE.find("<!DOCTYPE") != string::npos)
		{
			xmlLINE = xmlLINE.substr(xmlLINE.find("<!DOCTYPE") + 9);
			xmlLINE = xmlLINE.erase(0, xmlLINE.find_first_not_of(" \t\n\r\v"));
			lnct2 = lnct;
			while (dtdROOT.empty())
			{
				sTAG = xmlLINE.find_first_not_of(" \t\n\r\v");
				if (sTAG != string::npos)
				{
					dtdBUFF = xmlLINE.substr(sTAG, xmlLINE.find_first_of(" \t\n\r\v"));
					dtdBUFF = dtdBUFF.erase(0, dtdBUFF.find_first_not_of(" \t\n\r\v"));
					xmlLINE = xmlLINE.substr(sTAG + dtdBUFF.size(), xmlLINE.find_first_of(" \t\n\r\v"));
					if ((dtdBUFF.find_first_of("</[]\\'\"&;>:") == string::npos))
					{
						if (!(( dtdBUFF == "PUBLIC") || ( dtdBUFF == "SYSTEM")))
						{
							while (xmlLINE.find("<" + dtdBUFF) == string::npos)
							{
								xmlLINE = xmlLINE + "  " + xmlVECT.at(lnct2);
								if ( xmlLINE.find("<!--") != string::npos )
								{
									xmlCOMM = xmlLINE.substr(0, xmlLINE.find("<!--"));
									while (xmlLINE.find("-->") == string::npos)
									{
										lnct2++;
										xmlLINE = xmlVECT.at(lnct2);
									}
									xmlLINE = xmlLINE.substr(xmlLINE.find("-->") + 3);
									xmlLINE= xmlCOMM + xmlLINE;
									xmlCOMM.clear();
								}
								lnct2++;
							}
							lnct = lnct2;
							//xmlITEM.push_back("<!DOCTYPE " + xmlLINE.substr(0, xmlLINE.find("<" + dtdBUFF)));
							xmlLINE = xmlLINE.substr(xmlLINE.find("<" + dtdBUFF)) + xmlVECT.at(lnct);
							dtdROOT = dtdBUFF;
						}
				else if (( dtdBUFF == "PUBLIC") || ( dtdBUFF == "SYSTEM") || (dtdBUFF.find_first_of("</[]\\'\"&;>:") == string::npos))
						{
							cout << "FATALITY: root element not declared within DOCTYPE statement!" << endl;
							return 1;
						}
					}
				}
				lnct2++;
				if (dtdROOT.empty()) {xmlLINE = xmlVECT.at(lnct2);}
			}
		}
			while (!xmlLINE.empty())
			{
				if (!xmlCOMM.empty()) {xmlLINE = xmlCOMM + xmlLINE; xmlCOMM.clear();}
				if (!xmlBUFF.empty()) {xmlLINE = xmlBUFF + " " + xmlLINE; xmlBUFF.clear();}
				if ( xmlLINE.find("<!--") != string::npos )
				{
					xmlCOMM = xmlLINE.substr(0, xmlLINE.find("<!--"));
					while (xmlLINE.find("-->") == string::npos)
					{
						lnct++;
						xmlLINE = xmlVECT.at(lnct);
					}
						xmlLINE = xmlLINE.substr(xmlLINE.find("-->") + 3);
						xmlCOMM = xmlCOMM + xmlLINE;
						xmlLINE.clear();
				}
				
				cTAG = xmlLINE.find(">");
				oTAG = xmlLINE.find("<");
				if ((oTAG == string::npos) || (cTAG == string::npos))
				{
					if ((oTAG == string::npos) && (cTAG == string::npos) && !xmlLINE.empty())
					{
						xmlITEM.push_back(xmlLINE);
						xmlLINE.clear();
						break;
					}
					else if ((oTAG != string::npos) && (cTAG == string::npos))
					{
						xmlBUFF = xmlLINE.substr(oTAG);
						xmlITEM.push_back(xmlLINE.substr(0, oTAG));
						xmlLINE.clear();
						break;						
					}
					else if ((oTAG == string::npos) && (cTAG != string::npos))
					{
						cout << "FATALITY: a spurious > sign has been found!" << endl;
						return (-1);
					}
				}
				else
				{	
					if ((cTAG - oTAG) > 0)
					{
							xmlTOKN = xmlLINE.substr(0, oTAG);
							if (!xmlTOKN.empty()){xmlITEM.push_back(xmlTOKN);}
							xmlTOKN = xmlLINE.substr(oTAG, cTAG + 1 - oTAG);
							if (!(xmlTOKN.find("<") == xmlTOKN.find_last_of("<")))
							{
								cout << "FATALITY: A spurious < sign has been found!" << endl;
								return(-1);
							}
							if (!xmlTOKN.empty()){xmlITEM.push_back(xmlTOKN);}
							xmlLINE = xmlLINE.substr(cTAG + 1);
					}
					else
					{
						cout << "FATALITY: a spurious > sign has been found!" << endl;
						return (-1);
					}
				}
			}
	}
	// this section will be a separate structure...
	for (lnct = 0; lnct < xmlITEM.size(); lnct++)
	{
			xmlLINE = xmlITEM.at(lnct);
			if (xmlLINE.find("<") != string::npos)
			{
				// ok we have a semantically important structure, we now need to classify it
				//
				// 1. element closure
				// 2. element without attributes
				// 3. element with attributes
				// 4. element EMPTY, no attributes
				// 5. element EMPTY, with attributes

				if (xmlLINE.find(" ") == string::npos) // if no whitespaces are found, then either open / close / empty, no attributes
				{
					if (xmlLINE.find("</") != string::npos)
					{
						tryme = xmlLINE.substr(2, xmlLINE.size() - 3);
						cout << "CLOSING ELEMENT:" + tryme << endl;

					}
					else if (xmlLINE.find("/>") != string::npos)
					{
						tryme = xmlLINE.substr(1, xmlLINE.size() - 3);
						cout << "EMPTY ELEMENT:" + tryme << endl;
					}
					else
					{
						tryme = xmlLINE.substr(1, xmlLINE.size() - 2);
						cout << "OPENING ELEMENT:" + tryme << endl;
					}
				}
				else // whitespaces are found, so we have attributes contained!
				{
					if (xmlLINE.find("/>") != string::npos)
					{
						tryme = xmlLINE.substr(1, xmlLINE.find(" "));
						cout << "OPENING EMPTY ELEMENT WITH ATTRIBUTES:" + tryme << endl;
						xmlLINE = xmlLINE.substr(0, xmlLINE.find("/>"));
					}
					else
					{
						tryme = xmlLINE.substr(1, xmlLINE.find(" "));
						cout << "OPENING ELEMENT WITH ATTRIBUTES:" + tryme << endl;
						xmlLINE = xmlLINE.substr(0, xmlLINE.find(">"));
					}
					string rawseq;
					rawseq = xmlLINE.substr(xmlLINE.find(tryme) + tryme.size());
					int startQUOTE;
					int stopsQUOTE;
					string attributeNAME;
					string attributeVALUE;
					while (!rawseq.empty())
					{
						rawseq = rawseq.erase(0, rawseq.find_first_not_of(" \t\n\r\v"));
						startQUOTE = rawseq.find("\"");
						if (startQUOTE != string::npos)
						{
							stopsQUOTE = rawseq.find("\"", startQUOTE + 1);
							if (stopsQUOTE != string::npos)
							{
								attributeNAME = rawseq.substr(0, startQUOTE);
								// lets "validate" the name shall we...
								// find the = character
								int attrpos = attributeNAME.find("=");
								if ((attrpos != string::npos) && (attrpos == attributeNAME.find_last_of("=")))
								{
									attributeNAME = attributeNAME.substr(0, attrpos);
									// now do a preventive whitespace trim...

					attributeNAME = attributeNAME.erase(attributeNAME.find_last_not_of(" \t") + 1, attributeNAME.find_last_of(" \t"));
									if (attributeNAME.find_first_of(" \t\/\'();") != string::npos)
									{
										cout << "FATALITY: irregularities met during element parsing" << endl;
										return(-1);
									}
									tempobuffer = rawseq.substr(attrpos + 1, startQUOTE - attrpos - 1 );
									if (tempobuffer.find_first_not_of(" \t") != string::npos)
									{
										cout << "FATALITY: irregularities met during element parsing" << endl;
										return(-1);
									}
									attributeVALUE = rawseq.substr(startQUOTE + 1, stopsQUOTE - startQUOTE - 1);
									rawseq = rawseq.substr(stopsQUOTE + 1);
									cout << "\tNAME:" + attributeNAME + "\t VALUE:" + attributeVALUE << endl;
								}
								else
								{
									cout << "FATALITY: irregularities met during element parsing" << endl;
									return(-1);
								}
							}
							else
							{
									cout << "FATALITY: irregularities met during element parsing" << endl;
									return(-1);
							}
						}
						else
						{
									cout << "FATALITY: irregularities met during element parsing" << endl;
									return(-1);
						}
						if (rawseq == "?") rawseq.clear(); // <? ?> command - stuff has not been corrected yet, in process.
						// the above part of the code is put simply to avoid the subtlety for now...
						//loop ends
					}
					
					
				}
			}
	}

	xmlVECT.clear();
	xmlITEM.clear();
	return 0;
	}

	




More information about the alfs-discuss mailing list