miriam english (miriam_e) wrote,
miriam english

Radiolab talks and analysing my stories

Been a while since I blogged here. I should do so more often.

It has been my habit for years to eat my meals while listening to a talk, or watching a documentary, or watching a piece of fiction. Lately I've been listening to one of my favorite shows, Radiolab, after having downloaded a heap more of their shows. It really is an amazing show.

Yesterday I listened to an episode called "Oops". It is an hour long episode that originally aired on 28th June 2010. If you want to download it, the direct link is:
A lot of that episode was very funny, where they talked about the kind of silly errors that resulted from injudicious use of spell-checker programs, but one of the longer stories was extremely serious. It was how torture created an awful terrorist. I wish they'd followed the implications through more completely, but I was surprised that they just left it hanging there before moving on with the rest of the episode.

This morning I ate breakfast while listening to an episode from 26th July 2010 titled "Secrets of Success" in which Robert chatted with Malcolm Gladwell (one of my favorite thinkers) about what makes success. It was funny and very informative. I love the conclusion that, more than anything, doing something obsessively, basically for the love of it, is what makes someone so good at it that it often gets referred to somewhat mystically as "genius". It gives me hope that my writing might have some value, despite my vanishingly small audience.

Further to that last point, a few days ago I was listening to another Radiolab episode "Vanishing Words", from 5th of May, 2010.
The episode was about dementia, something that concerns me greatly, as it appears to run in my family. It is one of my greatest fears. The talk was largely about work that has been done using words as a window into the effect dementia has on the brain.

I couldn't stop thinking about it afterward, and ended up creating a fairly simple program that analysed each of my 6 novels, working out how many unique words each one contained, then attempted to estimate what kind of vocabulary that represented by dividing the unique words by total number of words. I'm not entirely sure this is the best, most reliable way to do this, but it might give a rough guide. I was surprised, and somewhat relieved to find that my books have been trending towards greater vocabularies. My story "flying" is a bit of an exception, having a very low vocabulary, but I think that may be because it consists almost entirely of dialogue and the main character is a fairly naïve young girl.

I love the fact that it's so damn easy to do that kind of thing in Linux. Unlike Microsoft Windows and Apple Mac computers, which actively discourage people from writing programs, Linux makes available dozens of easy tools for programming.

For my simple concordance program I used mostly sed, a very simple and fast stream editor that lets me feed text into a bunch of commands so that what comes out the other end is modified according to those commands. I also used Linux's tr, wc, sort, uniq, and bc commands. These are part of every Linux distribution.

I used sed mainly to get rid of any HTML tags I'd embedded in the text, and also to remove blank lines. The tr command let me translate certain characters to other characters (uppercase to lowercase so words that started sentences were not considered different, and spaces to end of line characters to put each word on its own line) and explicitly delete certain characters (mostly punctuation and numbers). The wc command counts characters, words, and lines in a text file. I sorted the file two ways using the sort command. Firstly, after each word had been put on a separate line, I sorted them alphabetically so I could then run uniq on the list, which collapsed the list down, getting rid of duplicates and prefixing each with the number of instances of that word. Then I sorted again, but this time numerically from least (most unusual) to highest (most common). I used bc, the commandline calculator to find the ratio of unique words to total words as a single floating point number. Really pretty simple.

Results were:
Vocabulary (unique/total)
Shirlocke: .14331
companions: .13691
selena: .13151
prescription: .12292
insurance: .11315
flying: .09655

Another way of measuring the text is to analyse sentence complexity. There is already a Linux command that can do that. It is called style, though I'm not sure the output is very useful for what I want. The manual does give various formulas for calculating sentence complexity, so that's useful. I may look at doing that another day.

For anybody who is interested, here is my quick and simple concordance program. The parts in red are comments. They're just there to help me understand what the heck I was doing when I read it again six months later.

(I've put the code behind a cut tag because LJ messes up the entire journal if I have long lines.)


# concordance
# by Miriam
# Saturday 2016-05-07 10:08:55 am
# After listening to the Radiolab episode
# "Radiolab 2010-05-05 - Vanishing Words"
# which talked about analysing texts for early signs of dementia
# I wondered what analysing my texts might reveal.
# I made a quick search for ideas on the way to do this
# and the most helpful site that I found was http://dsl.org/cookbook/cookbook_16.html
# which discussed existing low-level Linux commands that could do the job.
# Stripped out all HTML - snaffled my own code from my "wordcount" script for that.
# Also removed punctuation (but not dash or apostrophe) and numbers.
# Ensured text is Unix format, not MS format (filtered out 
# I couldn't be bothered with old Apple format -- they've changed to Unix format now anyway.
# Got rid of blank lines.
# Translated all to lowercase so words at sentence starts don't get counted separately.

# test to see if started from CLI or icon
tstcli=`tty | head -c3`
if [ "$tstcli" = "not" ]; then
	xmessage "EEEK!! Don't click here!
    Run from CLI."

function show_options {
	echo -e "usage: ${0##*/} <text_or_html_file>"
	echo -e "  Analyses text for vocabulary and word frequency."
	echo -e "  "

if [ "$1" = "" -o "$1" = "-h" ]; then

pname="${1%/*}"      ; # /mnt/drive/dir
fname="${1##*/}"     ; # file.tar.gz
bname="${fname%%.*}" ; # file
b2name="${fname%.*}" ; # file.tar
ename="${fname##*.}" ; # gz
e2name="${fname#*.}" ; # tar.gz

echo -e "Analysis of $1
" >"${b2name}_concordance.txt"
echo -e -n "Number of unique words (vocabulary): " >>"${b2name}_concordance.txt"

# remove HTML tags,
# delete punctuation and numbers,
# convert from MSWin format to Unix format by deleting all 
# translate spaces to newlines,
# delete blank lines
# translate everything to lowercase
# store in temporary file
cat "$1" | sed ':a; s/<[^>]*>//g;/</N;//ba' | tr -d '.,?":();!0-9' | sed 's/
//' | tr ' ' '
' | tr -d '	' | sed '/^$/d' | tr '[:upper:]' '[:lower:]' >/tmp/concord_temp

# calculate vocabulary
numberofwords=`wc -l /tmp/concord_temp | cut -d' ' -f1`
uniquewords=`cat /tmp/concord_temp | sort | uniq -c | wc -l`
vocab=`echo ${uniquewords}/${numberofwords} | bc` # I have bc permanently preset to 5 decimal places
echo -e "$uniquewords
Total number of words: $numberofwords
Vocabulary (unique/total): $vocab" >>"${b2name}_concordance.txt"

# create list: numbers of words
echo -e "
Numbers of individual words:" >>"${b2name}_concordance.txt"
cat /tmp/concord_temp | sort | uniq -c | sort -n >>"${b2name}_concordance.txt"

# if rox doesn't exist, print a message
# otherwise use rox to use the default text viewer to display result
if [ `which rox` = "" ]; then
	echo "Analysis of $fname is in ${b2name}_concordance.txt"
	rox "${b2name}_concordance.txt"

(Crossposted from http://miriam-e.dreamwidth.org/330209.html at my Dreamwidth account. Number of comments there so far: comment count unavailable)
  • Post a new comment


    default userpic

    Your reply will be screened

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.