Pull random line from a file

Kevan Shea biffcool at zombieworld.com
Fri Sep 6 09:37:03 PDT 2002


On Fri, 6 Sep 2002 17:05:40 +0100, Richard Lightman <richard at nezumi.plus.com> wrote:
> * Kevan Shea <biffcool at zombieworld.com> [2002-09-06 16:35]:
> > 
> > For completeness heres the end script:
> > 
> > NUMLINES=`grep -c --regexp=.* $FILENAME`
> > NUM=`expr $RANDOM % $NUMLINES + 1`
> > RNDLINE=`sed -n ${NUM}p $FILENAME`
> > 
> Why read the whole file twice?
> 
> awk 'rand()*NR<=1 {l=$0} END {print l}' $FILENAME
> 
> Richard

Okay I don't know awk at all.  So this is all from a skimming of man gawk, and an uncanny ability to completely misunderstand things.

rand() is a number between 0 and 1.  NR is the number of lines read so far.  So you have a number between 0 and the row you're currently reading and you're checking to see if it's less than or equal to 1.  If it is you're storing it as l.  Then after you've run that on every line print the last thing stored to l.

The problem with that is that earlier entries are much more likely to be chosen then later ones as rand()*3 is more likely to be less than 1 than rand()*100 is.

Whereas if you were to run rand()*(total number of lines) and then just get the resulting line.  You would be giving each line equal weight.

-- 
Kevan Shea





-- 
Kevan Shea
-- 
Unsubscribe: send email to listar at linuxfromscratch.org
and put 'unsubscribe lfs-chat' in the subject header of the message



More information about the lfs-chat mailing list