Pull random line from a file

Conrad Newton conrad.newton at broadpark.no
Fri Sep 6 12:23:38 PDT 2002


On Fri, Sep 06, 2002 at 06:05:55PM +0100, Richard Lightman wrote:
> * Kevan Shea <biffcool at zombieworld.com> [2002-09-06 17:38]:
> > On Fri, 6 Sep 2002 17:05:40 +0100, Richard Lightman <richard at nezumi.plus.com> wrote:
> > > 
> > > awk 'rand()*NR<=1 {l=$0} END {print l}' $FILENAME
> > > 
> > > Richard
> > 
> > Okay I don't know awk at all.  So this is all from a skimming of man gawk, and an uncanny ability to completely misunderstand things.
> > 
> > rand() is a number between 0 and 1.  NR is the number of lines read so far.  So you have a number between 0 and the row you're currently reading and you're checking to see if it's less than or equal to 1.  If it is you're storing it as l.  Then after you've run that on every line print the last thing stored to l.
> > 
> Correct.
> 
> > The problem with that is that earlier entries are much more likely to be chosen then later ones as rand()*3 is more likely to be less than 1 than rand()*100 is.
> > 
> Wrong
> 
> Say there are five lines:
> 
> Line | Probability for getting that line.
> 1    | 1/1 * 1/2 * 2/3 * 3/4 * 4/5 = 1/5
> 2    |       1/2 * 2/3 * 3/4 * 4/5 = 1/5
> 3    |             1/3 * 3/4 * 4/5 = 1/5
> 4    |                   1/4 * 4/5 = 1/5
> 5    |                         1/5 = 1/5
> 
> It is a trick I first saw in the source for nethack.

i.e. the probability that line n is less than one times
     the probability that all lines > n are greater than one

Very cool.  awk wins over sed.  Long Live awk!

Conrad

-- 
Unsubscribe: send email to listar at linuxfromscratch.org
and put 'unsubscribe lfs-chat' in the subject header of the message



More information about the lfs-chat mailing list