Pull random line from a file

Richard Lightman richard at nezumi.plus.com
Fri Sep 6 10:05:55 PDT 2002


* Kevan Shea <biffcool at zombieworld.com> [2002-09-06 17:38]:
> On Fri, 6 Sep 2002 17:05:40 +0100, Richard Lightman <richard at nezumi.plus.com> wrote:
> > 
> > awk 'rand()*NR<=1 {l=$0} END {print l}' $FILENAME
> > 
> > Richard
> 
> Okay I don't know awk at all.  So this is all from a skimming of man gawk, and an uncanny ability to completely misunderstand things.
> 
> rand() is a number between 0 and 1.  NR is the number of lines read so far.  So you have a number between 0 and the row you're currently reading and you're checking to see if it's less than or equal to 1.  If it is you're storing it as l.  Then after you've run that on every line print the last thing stored to l.
> 
Correct.

> The problem with that is that earlier entries are much more likely to be chosen then later ones as rand()*3 is more likely to be less than 1 than rand()*100 is.
> 
Wrong

Say there are five lines:

Line | Probability for getting that line.
1    | 1/1 * 1/2 * 2/3 * 3/4 * 4/5 = 1/5
2    |       1/2 * 2/3 * 3/4 * 4/5 = 1/5
3    |             1/3 * 3/4 * 4/5 = 1/5
4    |                   1/4 * 4/5 = 1/5
5    |                         1/5 = 1/5

It is a trick I first saw in the source for nethack.

Here is a quick test:

seq 1 100 >100
for((i=0; i<10000; i++)); do awk 'BEGIN {srand('$RANDOM')} rand()*NR<=1 {l=$0} END {print l}' 100; done >test
awk '$0 in a {++a[$0]; next} {a[$0]=1} END {for(i in a) print i,a[i]}' test

Normally, mawk seeds the random number generator from the time, which
would limit us to one sample per second. BEGIN {srand('$RANDOM')}
fixes that as we are not restarting bash's random number sequence
on each iteration.

Richard
-- 
Unsubscribe: send email to listar at linuxfromscratch.org
and put 'unsubscribe lfs-chat' in the subject header of the message



More information about the lfs-chat mailing list