x2sh boosting process - revised from the core and up!

George Makrydakis gmakmail at gmail.com
Thu Feb 23 10:17:30 PST 2006


Hello Everybody,

Some reviewing and a _NEW_ post with - closer to a solution of "our" "problem".

1. The x2sh library snapshot I posted is completely agnostic of what it parses, but it is slow when used in a big series of documents. I have tried this myself, 
in order to parse the entire LFS book and load it into a semantically significant "table" simulation using arrays in bash takes at about 40 - 50 min. This is 
because the script code is complex, requires a lot of counter variables while parsing, but most of all, it reads everything in the xml source it parses 
character by character. This is completely unacceptable under _any_ circumstance. Fact is, that however "weird" the syntax in the file may be, because of its 
character it would parse literally anything.

2. The genXML boosting parser script is extremely fast, for it takes only 30 - 40 s (a decrease in time by _60_ times and up), it is able to seek within the 
complete set of files for any element of type <element> , </element> and dump its output to a _valid_ xml file with every <element> and its contained data. 
unfortunately it does not support attributes so '<screen role="no dump">' "patterns" can escape. It is bash 3.0 compliant only because it uses the "perlish" =~ 
operator only implemented there. Incorporating attribute - awareness and inline - parsing within genXML would eventually lead to increased complexity especially 
when element attributes are laid out in a multi - line manner. This would lead to having more buffers, more variables, more lockups and eventually go to a 
situation like (1), making no practical use of the advantage the =~ operator offers.



******************************************

It appears obvious, that by design, the best approach would be to dump character-by-character parsing (1), and make a very simple script that reads the XML 
files and does not parse them, but instead, makes so that for a _given_ input, output is either data contained within a _pair_ of < and > characters or not. 
Then, "parsing" can be done in that specific output using =~ operators and others alike, while eliminating the need for more complex scripts. I present to you 
my newest implementation, going towards that direction.

*****************************************


Two versions of this new script are to be posted:

1. Everything is loaded into a conventional bash array where each entry contains data within <,> pair or NOT. Nothing is printed on screen.
2. Everything is printed on screen while "parsed" on the fly, nothing is stored in arrays or anything.

The reason for the two versions is simple: every printf builtin call takes some time. To make an option for printing the resulting array is useless for the time 
I get while displaying exceeds 4 (four) min while if I display it on the fly it takes nearly 1 min 20 s. Values under normal operational load. The version 
printing on the fly results useful on testing (making sure that no lines are omited during output, counter variables are set right, ecc). So far I think I have 
worked out possible pitfalls and bugs, but you never know. Since this is a direction better than before (the silent script parses and loads everything in the 
array in less than 50 s). I am also checking out the bash source files to see if there is any peculiar instruction or coding style that may be efficient for 
this kind of scripting.

Note that the way the case constructs are laid out make so that bash treats them as && and || "commands", so we are releaved of the need to use many 'if' 
structures coupled with && or || operators. This is detrimental to understanding how the script works. Also elements & attributes are preserved in their 
entirety and ready for regexp based filtering within the array.

Also, for the ones that want xpointer and related stuff _NOW_ (but the core must be worked out first!) try this for now: ./<scriptname>.sh | grep "xpointer" on 
the version of the script that gives printed output. Xinclude and related issues can be easily solved if the entirety of the book is parsed and loaded in a 
semantically and topographically meaningful manner in a time schedule of less than 2 (two) m. This is simply a demo, please remember that and bear me. Check the 
attachment for various versions.

Take note that inline DTD elements and comments within the xml documents are considered TRASH (of no importance). Check out previous x2sh for entity 
dereferencing (it is very quick even in the character - to - character parsing edition, more so on an approach as this).


Average execution times (complete LFS 6.1.1 xml source)

1. silent version: ~ 50 s almost even distribution between user / sys.
2. printing version: ~ 1 min 30 s distribution in favour of user vs sys.

Reducing number of sources and filtering input - types to the script leads to almost _factor_ decreases in execution time.

All under normal operational load (web browsing, various editors and java proggies running...). Having forced "parsing" for all xml files of the book has made 
it easier for me to debug some issues regarding counter variables and string manipulation that can be of use in a more "uninformed" version of the algorithm as 
laid out in the script. Thank you for your patience and undestanding. This script will run under both bash 2.x and 3.x versions.

MD5SUM is 6207d36085782fa45b3fb4f2115f8c67 *makeall.tar.bz2



Thank you for hosting my ideas on your mailing list. Waiting for your comments and bug reports.

George Makrydakis

gmak


#------------------------cut---------------------------------------------


#!/bin/bash

# x2sh booster - for the x2sh component to the jhalfs project
# author: George Makrydakis > gmakmail a|t gmail d0t c0m <
# license: GPL 2.0 or up
# revision: A1-print-nocomment
# instructions: run in the LFS book root

	declare -a x2SHraw
	declare -a x2SHchapters=(chapter01 \
				chapter02 \
				chapter03 \
				chapter04 \
				chapter05 \
				chapter06 \
				chapter07 \
				chapter08 \
				chapter09);
				
	declare -i x2SHindex=0
	declare -i lcnt=0

	declare  x2SHfile
	declare  originalsize

	declare otag
	declare ctag
	declare mpnt1
	declare mpnt2
	declare srcvar

	for x2SHpart in ${x2SHchapters[@]}
	do
		cd $x2SHpart
	for x2SHfile in *.xml
	do
		x2SHraw=(); lcnt=0;
		while read x2SHraw[lcnt]
		do
			((lcnt++))
		done <"$x2SHfile"

	for ((lcnt=0; lcnt < ${#x2SHraw[@]}; lcnt++));
	do
		case ${x2SHraw[lcnt]} in
			'')
			;;
			*)
				case ${x2SHraw[lcnt]} in
					*\<*)
						if [  "${x2SHraw[lcnt]%%<*}" != "" ] ; then
							printf "%s\n" "${x2SHraw[lcnt]%%<*}"
						fi
					;;
					*)
						if [ "${x2SHraw[lcnt]#>}" = "${x2SHraw[lcnt]}" ] ; then
							printf "%s\n" "${x2SHraw[lcnt]}"
						fi
					;;
				esac

			;;
		esac

		mpnt1="${x2SHraw[lcnt]}"
		mpnt2="${x2SHraw[lcnt]}"
		originalsize="${#x2SHraw[lcnt]}"

		until [ "$mpnt1" = "${x2SHraw[lcnt]##*<}" ] && \
		      [ "$mpnt2" = "${x2SHraw[lcnt]##*>}" ] ;	
		do
			mpnt1=${mpnt1#*<}; mpnt2=${mpnt2#*>}
			otag=$((originalsize - ${#mpnt1} - 1))
			ctag=$((originalsize - ${#mpnt2} - otag))
			if [ $ctag -ge 0 ] ; then
				printf "%s\n" "${x2SHraw[lcnt]:$otag:$ctag}"
				srcvar="$mpnt1"; srcvar="${srcvar#*>}"; srcvar="${srcvar%%<*}"
				case "$srcvar" in
					'')
					;;
					*)
						printf "%s\n" "$srcvar"
						srcvar=""
					;;
					
				esac
			elif [ $ctag -lt 0 ] ; then
				x2SHraw[$((lcnt + 1))]="<""${x2SHraw[lcnt]##*<}"" ${x2SHraw[$((lcnt + 1))]}"
				break
			fi
		done
	done
done
cd ..
done


#---------------------------------cut-------------------------------------------------------------------------




More information about the alfs-discuss mailing list