Complete text mining by vivienyuwenchen · Pull Request #1 · sd17fall/TextMining

vivienyuwenchen · 2017-10-09T05:35:27Z

Revised. Fixed syntax. Imported functions from text_mining instead of repeating them in text_mining_tfidf. Removed redundant stop_words removal from text_mining_tfidf, which changes the TF-IDF score of each word (weighs stop words into the score).

matthewruehle · 2017-10-16T02:37:28Z

Project Writeup and Reflection.md

+
+Top 50 Words in Paradise_Lost:
+
+- ['thir', 'thy', 'thou', 'thee', "heav'n", 'shall', 'th', 'god', 'earth', 'man', 'high', 'great', 'death', 'till', 'hath', 'hell', 'stood', 'day', 'good', 'like', 'things', 'night', 'light', 'farr', 'love', 'eve', 'o', 'world', 'adam', 'soon', 'let', 'hee', 'son', 'life', 'know', 'place', 'long', 'forth', 'self', 'mee', 'ye', 'way', 'power', 'hand', 'new', 'deep', 'end', 'fair', 'men', 'satan']


Hmm. I wonder how much of the differences could be associated with unrecognized words - e. g., archaic spellings or contractions which the sentiment analyzer doesn't recognize and thus returns "neutral" for.

matthewruehle · 2017-10-16T02:41:34Z

text_mining.py

+    Returns:
+        text from url
+    """
+    if exists(file_name) == False:


Small style thing - rather than checking if a boolean is equal to false, we can just do "if not exists(filename) :" or "!exists".

Side note, I like the structure of reading a file unless the file doesn't exist, then grabbing from the URL instead. It makes the program nice and portable!

matthewruehle · 2017-10-16T02:43:39Z

text_mining.py

+        word_list[i] = word_list[i].strip(string.punctuation)
+
+    stop_words = get_stop_words('en')
+    stop_words_2 = ["a", "about", "above", "across", "after", "afterwards", "again", "against", "all",


Out of curiosity, what happens if you only use the stop_words words, rather than your manually-assembled ones?

stop_words = ['a', 'about', 'above', 'after', 'again', 'against', 'all', 'am', 'an', 'and', 'any', 'are', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', "can't", 'cannot', 'could', "couldn't", 'did', "didn't", 'do', 'does', "doesn't", 'doing', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', "hadn't", 'has', "hasn't", 'have', "haven't", 'having', 'he', "he'd", "he'll", "he's", 'her', 'here', "here's", 'hers', 'herself', 'him', 'himself', 'his', 'how', "how's", 'i', "i'd", "i'll", "i'm", "i've", 'if', 'in', 'into', 'is', "isn't", 'it', "it's", 'its', 'itself', "let's", 'me', 'more', 'most', "mustn't", 'my', 'myself', 'no', 'nor', 'not', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'ought', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 'same', "shan't", 'she', "she'd", "she'll", "she's", 'should', "shouldn't", 'so', 'some', 'such', 'than', 'that', "that's", 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'there', "there's", 'these', 'they', "they'd", "they'll", "they're", "they've", 'this', 'those', 'through', 'to', 'too', 'under', 'until', 'up', 'very', 'was', "wasn't", 'we', "we'd", "we'll", "we're", "we've", 'were', "weren't", 'what', "what's", 'when', "when's", 'where', "where's", 'which', 'while', 'who', "who's", 'whom', 'why', "why's", 'with', "won't", 'would', "wouldn't", 'you', "you'd", "you'll", "you're", "you've", 'your', 'yours', 'yourself', 'yourselves']

It worked pretty well, but there were a few common words that weren't included in stop_words (I can only remember 'one' being a very common word off the top of my head), so I just googled another set of stop words and copied it over.

matthewruehle · 2017-10-16T02:44:51Z

text_mining.py

+
+    ordered_by_frequency = sorted(word_counts, key=word_counts.get, reverse=True)
+
+    return ordered_by_frequency[0:n]


FWIW, you can shorten [0:n] to [:n] and the 0 is implied. Your thing still works, though - just a personal preference thing, really.

matthewruehle · 2017-10-16T02:46:33Z

text_mining.py

+        # print the sentiment of the top n words
+        print('Sentiment of Top %d Words in %s:' % (n, title))
+        print(sentiment_analyzer(top_n_words), '\n')
+        # print the sentiment of the whole text


This might be a little too much commenting - print statements like these mostly stand on their own.

Though, a bit too much documentation is better than a bit too little!

matthewruehle · 2017-10-16T02:47:38Z

text_mining_tfidf.py

+from textblob import TextBlob as tb         # pip install textblob
+
+
+def get_cache(url, file_name):


One thing to look into: you can import your own functions in Python (e. g., "from text_mining import get_cache" - it'd save you some repeated code!

vivienyuwenchen added 12 commits October 8, 2017 19:39

Word clouds for text mining

608306f

Complete text mining with word frequency analysis

e346243

Complete text mining with TF-IDF

b472a98

Text files for text mining

5ada9fc

Create Project Writeup and Reflection.md

b5d1dfc

Update README.md

b945776

Update with inline comments

3f4dc7e

Merge branch 'master' of https://github.com/vivienyuwenchen/TextMining

10a2277

Update Project Writeup and Reflection.md

71c8a53

Update Project Writeup and Reflection.md

ee3779b

Delete The_Romance_of_Lust_tfidf.png

6ea3f68

Delete The_Romance_of_Lust_wf.png

295f8b5

matthewruehle reviewed Oct 16, 2017

View reviewed changes

vivienyuwenchen added 2 commits October 18, 2017 23:18

Revise syntax and add import feature

0fdb681

Merge branch 'master' of https://github.com/vivienyuwenchen/TextMining

c3f7cd5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Complete text mining#1

Complete text mining#1
vivienyuwenchen wants to merge 14 commits intosd17fall:masterfrom
vivienyuwenchen:master

vivienyuwenchen commented Oct 9, 2017 •

edited

Loading

Uh oh!

matthewruehle Oct 16, 2017

Uh oh!

matthewruehle Oct 16, 2017

Uh oh!

matthewruehle Oct 16, 2017

Uh oh!

matthewruehle Oct 16, 2017

Uh oh!

vivienyuwenchen Oct 17, 2017

Uh oh!

vivienyuwenchen Oct 17, 2017

Uh oh!

matthewruehle Oct 16, 2017

Uh oh!

matthewruehle Oct 16, 2017

Uh oh!

matthewruehle Oct 16, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		Top 50 Words in Paradise_Lost:

		- ['thir', 'thy', 'thou', 'thee', "heav'n", 'shall', 'th', 'god', 'earth', 'man', 'high', 'great', 'death', 'till', 'hath', 'hell', 'stood', 'day', 'good', 'like', 'things', 'night', 'light', 'farr', 'love', 'eve', 'o', 'world', 'adam', 'soon', 'let', 'hee', 'son', 'life', 'know', 'place', 'long', 'forth', 'self', 'mee', 'ye', 'way', 'power', 'hand', 'new', 'deep', 'end', 'fair', 'men', 'satan']


		ordered_by_frequency = sorted(word_counts, key=word_counts.get, reverse=True)

		return ordered_by_frequency[0:n]

		from textblob import TextBlob as tb # pip install textblob


		def get_cache(url, file_name):

Conversation

vivienyuwenchen commented Oct 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vivienyuwenchen commented Oct 9, 2017 •

edited

Loading