• +91 9898989898
  • info@assignmentonlinehelp.com
Blog

Build A Data Structure To Efficiently Calculate Word Occurrence And Co-Location Statistics

In this assignment, you will write a program that reads a text document as input and builds a data structure to efficiently calculate word occurrence and co-location statistics. These statistics form the basis for many for a wide variety of algorithms in areas ranging from spelling correction to natural language processing.
As before, you should develop your code with JUnit tests. For the demonstration of your classes and methods, you should go online and find an electronic text that interests you (Project Gutenberg is a good place to start). Use this text for your demonstration. Feel free to use more than one example text. Be sure to also illustrate the extra credit functions if you have done them.

The general steps are:
• read the text file, extract and normalize the words
• compute word and word pair statistics
• implement functionality for a variety of queries:
– word counts and rank of specific words or word pairs
– k most (or least) common words or pairs
– conditional word count given a preceding or following word (collocation)
Notes: Your implementation should also account for word pairs on separate lines, i.e. the last word of linenand the first word of line n+ 1 is also a valid word pair. You do not have to handle hyphenation. Most texts from Project Gutenberg do not have hyphenated words that span different lines, so this not an issue. However, hyphenated words like “grown-up” will be counted as the single word “grownup” by virtue of the word normalization, one step of which is removing the punctuation. In the specifications below, all the methods should be public unless otherwise noted.

Tokenizer
Write a class Tokenizer that provides the functionality to read lines of a text file, extract and normalize the words, and store them using Java’s ArrayList class composed of String elements. You can use Java’s BufferedReader or Scanner classes to read text files and the ArrayList class to store the word list. Refer to Java’s regular expression documentation for the whitespace and punctuation character classes.
• A constructor with a String argument specifies the file from which to obtain the words. • A constructor with a String[] obtains the words directly from the String array. No file is read. • ArrayList wordList() returns the word list created from the constructors.
• Split lines from the text file into words using whitespace characters. You can use Java’s split method in the String class.
• Normalize words for both constructors as follows:
–               Convert words to lower case.
–               Remove leading and trailing whitespace. – Remove punctuation.
 
SOLUTION:
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Scanner;
 
public class Tokenizer {
    private final String filename;
    private final String[] words;
 
    public Tokenizer(String filename) {
        this.filename = filename;
        this.words = null;
    }
 
    public Tokenizer(String[] words) {
        this.filename = null;
        this.words = words;
    }
 
    private void addNormalizedWordsFromArray(ArrayList normalized, String[] words) {
        for (String word : words) {
            normalized.add(word.toLowerCase());
        }
    }
 
    public ArrayList wordList() {
        if (filename != null) {
            try(Scanner scanner = new Scanner(new File(filename))) {
                ArrayList result = new ArrayList<>();
                while(scanner.hasNextLine()) {
                    String line = scanner.nextLine();
                    String[] words = line.split("[^a-zA-Z]");
                    addNormalizedWordsFromArray(result, words);
                }
                return result;
            }
            catch (IOException e) {
                return null;
            }
        }
        else {
            ArrayList result = new ArrayList<>();
            addNormalizedWordsFromArray(result, words);
            return result;
        }
    }
}
 
HashEntry
Define a class HashEntry which you will use to define HashTable class below.
• The constructor HashEntry(String key, int value) should initialize a HashEntry with key and value.
1
• Provide accessor methods String getKey() and int getValue() for the key and value fields.
• void setValue(int value) should set or update the value field.
 
SOLUTION:
public class HashEntry {
    private final String key;
    private int value;
 
    public HashEntry(String key, int value) {
        this.key = key;
        this.value = value;
    }
 
    public String getKey() {
        return key;
    }
 
    public int getValue() {
        return value;
    }
 
    public void setValue(int value) {
        this.value = value;
    }
}
 
HashTable
Write a class HashTable for String keys. You will use Java’s built-in hashCode() function to compute hash codes for strings, but you should not use Java’s built-in HashTable class to implement this class. The hashCode function can return negative numbers, so you also take the absolute value before modding it by the table size. The hash table entries should be defined using the HashEntry class above.
•  A HashTable constructor with no arguments should initialize a table of a reasonable default size. • A HashTable constructor with an integer argument initializes the table to the specified size. • void put(String key, int value) should store the key-value pair in the hash table. In addition:
– It should handle collisions. You may use either separate chaining or a probing strategy. – It should resize the table when necessary.
•  void put(String key, int value, int hashCode) (note the different signature) should have the same functionality as put above, but use the provided hash code, which you should assume comes from a different hashCode function other than Java’s. This is to allow for direct testing of collisions.
•  void update(String key, int value) should update value associated with key in the hash table. If key does not exist, it should be added to the table.
•  int get(String key) returns current the value associated with key if key exists and -1 if it does not.
•  int get(String key, int hashCode) should have the same functionality as get above, but use the provided hash code (which should then be modded by the table size). This is to allow for direct testing of collisions.

SOLUTION:
 import java.util.ArrayList;
import java.util.List;
 
public class HashTable {
    private static final int DEFAULT_SIZE = 25;
    private static final double RESIZE_THRESHOLD = 0.8;
 
    private List> table;
    private int size;
    private int buckets;
 
    public HashTable(int buckets) {
        this.buckets = buckets;
        table = new ArrayList<>();
        for (int i = 0; i < buckets; i++) {
            table.add(new ArrayList<>());
        }
        this.size = 0;
    }
 
    public HashTable() {
        this(DEFAULT_SIZE);
    }
 
    public void put(String key, int value) {
        put(key, value, key.hashCode());
    }
 
    public void put(String key, int value, int hashCode) {
        table.get(getBucket(hashCode)).add(new HashEntry(key, value));
        size++;
        if (1.0 * size / buckets > RESIZE_THRESHOLD) {
            resize();
        }
    }
 
    public void update(String key, int value) {
        int bucket = getBucket(key.hashCode());
        List list = table.get(bucket);
        for (HashEntry entry : list) {
            if (entry.getKey().equals(key)) {
                entry.setValue(value);
                return;
            }
        }
        put(key, value);
    }
 
    public int get(String key) {
        return get(key, key.hashCode());
    }
 
    public int get(String key, int hashCode) {
        int bucket = getBucket(hashCode);
        List list = table.get(bucket);
        for (HashEntry entry : list) {
            if (entry.getKey().equals(key)) {
                return entry.getValue();
            }
        }
        return -1;
    }
 
    private int getBucket(int hash) {
        if (hash < 0) {
            hash = -hash;
        }
        return hash % buckets;
    }
 
    private void resize() {
        List> oldTable = table;
 
        table = new ArrayList<>();
        buckets *= 2;
        for (int i = 0; i             table.add(new ArrayList<>());
        }
 
        for (List list : oldTable) {
            for (HashEntry hashEntry : list) {
                put(hashEntry.getKey(), hashEntry.getValue());
            }
        }
    }
}
 
WordStat
Here you will write a WordStat class to compute various words statistics. You should design the class and use your classes above so that the statistics are computed once upon construction for a given source, rather than for each query.
• a constructor with a String argument computes the statistics from the file name. • a constructor with a String array argument computes the statistics from the words in the String array. • the methods below that take word arguments should be normalized before computing the result.
• int wordCount(String word) returns the count of the word argument. Return zero if the word is not in the table.
• int wordPairCount(String w1, String w2) returns the count of the word pair w1 w2. Return zero if the word pair is not in the table.
• int wordRank(String word) returns the rank of word, where rank 1 is the most common word.
• int wordPairRank(String w1, String w2) returns the rank of the word pair w1 w2 (which should also be normalized). Return zero if the word pair is not in the table.
• String [] mostCommonWords(int k) returns a String array of the k most common words in decreasing order of their count.
• String [] leastCommonWords(int k) returns a String array of the k least common words in increasing order of their count.
• String [] mostCommonWordPairs(int k) returns the k most common the word pair in a String array, where each element is in the form "word1 word2", i.e. separated by a single space.
2
• String [] mostCommonCollocs(int k, String baseWord, int i) returns the k most common words at a given relative position i to baseWord. These are called “collocations.” The relative position can be either
+1 or -1 to indicate words following or preceding the base word. For example, mostCommonCollocs(10,"crash",-1)
would return the 10 most common words thatprecede“crash” in the source text. You do not need to
implement relative positions other than +1 or -1.

SOLUTION:
 import java.util.ArrayList;
import java.util.Comparator;
 
public class WordStat {
    private final ArrayList wordlist;
    private ArrayList pairlist;
 
    private HashTable hashTable;
    private HashTable hashTable2;
 
    public WordStat(String filename) {
        Tokenizer tokenizer = new Tokenizer(filename);
        wordlist = tokenizer.wordList();
    }
 
    public WordStat(String[] words) {
        Tokenizer tokenizer = new Tokenizer(words);
        wordlist = tokenizer.wordList();
    }
 
    private void buildStat() {
        hashTable = new HashTable();
        hashTable2 = new HashTable();
 
        for (String word : wordlist) {
            int count = hashTable.get(word);
            if (count == 0) {
                hashTable.put(word, 1);
            }
            else {
                hashTable.update(word, count+1);
            }
        }
 
        pairlist = new ArrayList<>();
        for (int i = 0; i             String pair = wordlist.get(i) + " " + wordlist.get(i+1);
            pairlist.add(pair);
            int count = hashTable2.get(pair);
            if (count == 0) {
                hashTable2.put(pair, 1);
            }
            else {
                hashTable2.update(pair, count+1);
            }
        }
 
        wordlist.sort((o1, o2) -> Integer.compare(hashTable.get(o2), hashTable.get(o1)));
        pairlist.sort((o1, o2) -> Integer.compare(hashTable2.get(o2), hashTable2.get(o1)));
    }
 
    public int wordCount(String word) {
        return Math.max(0, hashTable.get(word));
    }
 
    public int wordPair(String w1, String w2) {
        return Math.max(0, hashTable.get(w1 + " " + w2));
    }
 
    public int wordRank(String word) {
        return wordlist.indexOf(word) + 1;
    }
 
    public int wordPairRank(String w1, String w2) {
        return wordlist.indexOf(w1 + " " + w2) + 1;
    }
 
    public String[] mostCommonWords(int k) {
        return wordlist.subList(0, Math.min(wordlist.size(), k)).toArray(new String[0]);
    }
 
    public String[] leastCommonWords(int k) {
        return wordlist.subList(Math.max(0, wordlist.size()-k), wordlist.size()).toArray(new String[0]);
    }
 
    public String[] mostCommonWordPairs(int k) {
        return pairlist.subList(0, Math.min(pairlist.size(), k)).toArray(new String[0]);
    }
 
    public String[] mostCommonCollocs(int k, String baseWord, int i) {
        ArrayList result = new ArrayList<>();
        for (String pair : pairlist) {
            String[] words = pair.split("\\s+");
            if (i == -1 && words[1].equals(baseWord)) {
                result.add(words[0]);
            }
            if (i == 1 && words[0].equals(baseWord)) {
                result.add(words[1]);
            }
        }
        return result.subList(0, Math.min(result.size(), k)).toArray(new String[0]);
    }
}
 
3

Related Posts