Untitled Document

CSE 161 Assignment #7: Literary Analysis

All parts of this assignment are due using Blackboard turn in by 10:00 am on Saturday, November 14.

Lab TAs will assign new partners for this assignment (again) at the lab on Nov 3. Please attend the lab at least long enough to learn your partner assignment. If you are unable to attend, you are responsible for contacting your Lab TA to learn your partner assignment.

Literary analysis is making increasing use of computer algorithms to determine stylistic properties of texts. For example, a scholar may be faced with the question of determining whether a newly discovered play from the time of Shakespeare was in fact authored by Shakespeare. Arguments for or against the attribution could be based on similarities or differences in the vocabulary, average sentence length, and other features of the play compared to known works by Shakespeare and his contemporaries.

In this assignment, you will create a program that compares two literary works and calculates how similar their vocabularies are. You will then use your program to make a guess at the authorship of a "newly discovered" work of literature.

Part I: Programming

Download this file containing the function TopWords. You will add code to this file in order to complete your assignment. The function does the following:

TopWords(FILENAME, N): given FILENAME (string) and N (integer), return a list of the top N most frequently occurring words in the text stored in the file.

You should add to the program file two more functions:

CompareWords(FILE1, FILE2, N): given the names of two files and an integer N, compute and return the number of words that are in the top N list of FILE1 that are also in the top N list of FILE2. Note that CompareWords does not print anything, it just returns an integer result.

main( ): Read the names of two files and an integer N from the user. Use CompareWords to calculate the number of words in common between the two files. Print out the number of words in common, and the fraction of words in common (the fraction is (number of words in common) / float(N)).

For example, a session might be:

>main()
First file name: doyle.txt
Second file name: shakespeare.txt
Top number of words: 200
Number words in common: 144
Percentage words in common: 0.72

Test and debug your program using small text files you create yourself. You can create text files by saving a Microsoft Word file with format "plain text". You can also download textA.txt and textB.txt. Here are the results you should get for comparing the top 50 words in the two files:

First file name: textA.txt
Second file name: textB.txt
Top number of words: 50
Number words in common:  32
Percentage words in common:  0.64

Part II: Using the Program to Determine Authorship

Download the following files:

wsce.txt	William Shakespeare, Comedy of Errors (play)
bsam.txt	Bernard Shaw, Arms and the Man (play)
wss.txt	William Shakespeare, Sonnets (poetry)
wbp.txt	William Blake, Poems (poetry)
unknown1.txt	Unknown poetry
unknown2.txt	Unknown poetry
unknown3.txt	Unknown play
unknown4.txt	Unknown play

Use your program to provide evidence to answer the following questions:

Was unknown1.txt written by Shakespeare or Blake?
Was unknown2.txt written by Shakespeare or Blake?
Was unknown3.txt written by Shakespeare or Shaw?
Was unknown3.txt written by Shakespeare or Shaw?

Experiment with different settings for N. Do you get stronger or weaker evidence for N = 25, 50, 100, 200, 400?

Write up the results of your experiments and conclusions in a text or Microsoft Word file named report.txt or report.doc. Upload your report together with your modified Python code file. Include the name of your partner in the comments box.

Extra Credit (Optional): Average Sentence Length

Write a program that calculates the average length of the sentences in a file. Assume that sentences end with a period, question mark, or exclamation point. You should study the code in topwords.py to understand how to read the contents of a file into a variable as one long string, and how to use string.split to split the string into a list of words. One strategy for solving this problem is to make sure that the end of sentence punctuation marks are treated as separate words by string.split. Think out how you can use the string.replace function to add white space on both sides of the punctuation marks, so that string.split creates a item for each end of sentence punctuation mark. Once you have a long list of "real" words and end of sentence marks, your program can march down the list, calculating the length of each sentence, the number of sentences, and the sum of the length of all of the sentences. From this information you can compute the average sentence length. Use your program for the following tasks:

Determine whether the problem of determining the authorship of the unknown plays above can be solved by comparing average sentence length.
Go to the web site http://www.gutenberg.org, which hosts a huge collection of free public-domain literature in the form of plain text files. Try to find an author who writes unusually long sentences on average, and another who writes unusually short sentences on average.

Write up your results in a text or Microsoft word file called extracredit.txt or extracredit.doc. Save your Python program in a file called sentencelength.py. Turn these in using the Extra Credit for Assignment 7 link on Blackboard (do not include them with the rest of the assignment).