The CDC data used for segmentation experiments reported in Read et. al. (2012)
was created using the open-source Conan Doyle Corpus (CDC) available from
http://www.delph-in.net/cdc/. 

To create the unsegmented.txt file, used in the segmentation experiments:

cat {baskerville,cardboard,circle,wisteria}*.txt |perl -pe 's/ +/ /g;'|\
	perl -pe 's/\n/ /;'| perl -pe 's/  +/\n\n/g' > unsegmented.txt

And the segmented.txt, used for evaluation:

cat {baskerville,cardboard,circle,wisteria}*.txt |grep -v "^$" > segmented.txt