Conversation
|
|
||
| Top 50 Words in Paradise_Lost: | ||
|
|
||
| - ['thir', 'thy', 'thou', 'thee', "heav'n", 'shall', 'th', 'god', 'earth', 'man', 'high', 'great', 'death', 'till', 'hath', 'hell', 'stood', 'day', 'good', 'like', 'things', 'night', 'light', 'farr', 'love', 'eve', 'o', 'world', 'adam', 'soon', 'let', 'hee', 'son', 'life', 'know', 'place', 'long', 'forth', 'self', 'mee', 'ye', 'way', 'power', 'hand', 'new', 'deep', 'end', 'fair', 'men', 'satan'] |
There was a problem hiding this comment.
Hmm. I wonder how much of the differences could be associated with unrecognized words - e. g., archaic spellings or contractions which the sentiment analyzer doesn't recognize and thus returns "neutral" for.
text_mining.py
Outdated
| Returns: | ||
| text from url | ||
| """ | ||
| if exists(file_name) == False: |
There was a problem hiding this comment.
Small style thing - rather than checking if a boolean is equal to false, we can just do "if not exists(filename) :" or "!exists".
There was a problem hiding this comment.
Side note, I like the structure of reading a file unless the file doesn't exist, then grabbing from the URL instead. It makes the program nice and portable!
| word_list[i] = word_list[i].strip(string.punctuation) | ||
|
|
||
| stop_words = get_stop_words('en') | ||
| stop_words_2 = ["a", "about", "above", "across", "after", "afterwards", "again", "against", "all", |
There was a problem hiding this comment.
Out of curiosity, what happens if you only use the stop_words words, rather than your manually-assembled ones?
There was a problem hiding this comment.
stop_words = ['a', 'about', 'above', 'after', 'again', 'against', 'all', 'am', 'an', 'and', 'any', 'are', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', "can't", 'cannot', 'could', "couldn't", 'did', "didn't", 'do', 'does', "doesn't", 'doing', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', "hadn't", 'has', "hasn't", 'have', "haven't", 'having', 'he', "he'd", "he'll", "he's", 'her', 'here', "here's", 'hers', 'herself', 'him', 'himself', 'his', 'how', "how's", 'i', "i'd", "i'll", "i'm", "i've", 'if', 'in', 'into', 'is', "isn't", 'it', "it's", 'its', 'itself', "let's", 'me', 'more', 'most', "mustn't", 'my', 'myself', 'no', 'nor', 'not', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'ought', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 'same', "shan't", 'she', "she'd", "she'll", "she's", 'should', "shouldn't", 'so', 'some', 'such', 'than', 'that', "that's", 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'there', "there's", 'these', 'they', "they'd", "they'll", "they're", "they've", 'this', 'those', 'through', 'to', 'too', 'under', 'until', 'up', 'very', 'was', "wasn't", 'we', "we'd", "we'll", "we're", "we've", 'were', "weren't", 'what', "what's", 'when', "when's", 'where', "where's", 'which', 'while', 'who', "who's", 'whom', 'why', "why's", 'with', "won't", 'would', "wouldn't", 'you', "you'd", "you'll", "you're", "you've", 'your', 'yours', 'yourself', 'yourselves']
There was a problem hiding this comment.
It worked pretty well, but there were a few common words that weren't included in stop_words (I can only remember 'one' being a very common word off the top of my head), so I just googled another set of stop words and copied it over.
text_mining.py
Outdated
|
|
||
| ordered_by_frequency = sorted(word_counts, key=word_counts.get, reverse=True) | ||
|
|
||
| return ordered_by_frequency[0:n] |
There was a problem hiding this comment.
FWIW, you can shorten [0:n] to [:n] and the 0 is implied. Your thing still works, though - just a personal preference thing, really.
| # print the sentiment of the top n words | ||
| print('Sentiment of Top %d Words in %s:' % (n, title)) | ||
| print(sentiment_analyzer(top_n_words), '\n') | ||
| # print the sentiment of the whole text |
There was a problem hiding this comment.
This might be a little too much commenting - print statements like these mostly stand on their own.
Though, a bit too much documentation is better than a bit too little!
text_mining_tfidf.py
Outdated
| from textblob import TextBlob as tb # pip install textblob | ||
|
|
||
|
|
||
| def get_cache(url, file_name): |
There was a problem hiding this comment.
One thing to look into: you can import your own functions in Python (e. g., "from text_mining import get_cache" - it'd save you some repeated code!
Revised. Fixed syntax. Imported functions from text_mining instead of repeating them in text_mining_tfidf. Removed redundant stop_words removal from text_mining_tfidf, which changes the TF-IDF score of each word (weighs stop words into the score).