Better training data – Natural Language Processing With Python and NLTK p.18

Python

Video is ready, Click Here to View ×


After some consideration it became clear that a new dataset would solve a lot of problems. This tutorial covers employing a new dataset, and what is involved in this process.

This time, we’re using a movie reviews data set that contains much shorter movie reviews.

You can get this data set from: http://pythonprogramming.net/static/downloads/short_reviews/

This one yields us a far more reliable reading across the board, and is far more fitting for the tweets we intend to read from the…

47 thoughts on “Better training data – Natural Language Processing With Python and NLTK p.18

  1. I got this error:

    Traceback (most recent call last):
    File "E:ubfinal projectscriptsentiment analysistext classification.py", line 123, in <module>
    print ("MNB_classifier accuracy percent", (nltk.classify.accuracy (MNB_classifier, testing_set))*100)
    File "C:python36libsite-packagesnltkclassifyutil.py", line 87, in accuracy
    results = classifier.classify_many([fs for (fs, l) in gold])
    File "C:python36libsite-packagesnltkclassifyscikitlearn.py", line 85, in classify_many
    X = self._vectorizer.transform(featuresets)
    File "C:python36libsite-packagessklearnfeature_extractiondict_vectorizer.py", line 291, in transform
    return self._transform(X, fitting=False)
    File "C:python36libsite-packagessklearnfeature_extractiondict_vectorizer.py", line 183, in _transform
    raise ValueError("Sample sequence X is empty.")
    ValueError: Sample sequence X is empty.

    Anybody know how I can solve this problem? Thank you.

  2. I saved two files positive.txt and negative .txt in a folder short_reviews in the same directory where Scripts is saved.
    Still I am getting the same error
    FileNotFoundError:No such file file or directory:'short_reviews/positive.txt'

    Will you pls help me with the saving path

  3. Hey, sentdex. I want to ask you. When i try to run your code, i got this error :

    Traceback (most recent call last):
    File "C:UsersSofianDocumentsNLTK PythonBetter Training Data.py", line 92, in <module>
    #GaussianNB_classifier = SklearnClassifier(GaussianNB())
    File "C:UsersSofianAppDataLocalProgramsPythonPython36-32libsite-packagesnltkclassifyscikitlearn.py", line 117, in train
    X = self._vectorizer.fit_transform(X)
    File "C:UsersSofianAppDataLocalProgramsPythonPython36-32libsite-packagessklearnfeature_extractiondict_vectorizer.py", line 230, in fit_transform
    return self._transform(X, fitting=True)
    File "C:UsersSofianAppDataLocalProgramsPythonPython36-32libsite-packagessklearnfeature_extractiondict_vectorizer.py", line 172, in _transform
    values.append(dtype(v))
    MemoryError

    What is the meaning of this error and how to fix it?Thank you for any help 🙂

  4. Shouldn't we be shuffling the documents here instead of the featureset?
    Taking first 5000 of all_words would rather mean mostly the positive words since we appended short_pos_words first.

  5. My accuracy is coming out to be 50-55 %, no matter how many times i run it. I am using imdb dataset. How to improve the accuracy?
    Moreover nusvc is also not working. The shell restarts after some time whenever i run it. What to do?

  6. Good day, could anyone please help me. I run into an error, I am using spyder on mac os and whenever  I try to compile the code you have provided, it shows that 

    runfile('/Users/Darkhan/python', wdir='/Users/Darkhan')
    Traceback (most recent call last):
    File "<ipython-input-28-c514b7b79643>", line 1, in <module>
     runfile('/Users/Darkhan/python', wdir='/Users/Darkhan')
    File "/Users/Darkhan/anaconda/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 880, in runfile
     execfile(filename, namespace)
    File "/Users/Darkhan/anaconda/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 102, in execfile
     exec(compile(f.read(), filename, 'exec'), namespace)
    File "/Users/Darkhan/python", line 53, in <module>
     readMe = open('text.txt','r').read()
    File "/Users/Darkhan/anaconda/lib/python3.6/encodings/ascii.py", line 26, in decode
     return codecs.ascii_decode(input, self.errors)[0]

    UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 3118: ordinal not in range(128)

    I am doing everything as you did but this new data set can`t be read by python, I think it is due to the size of .txt because when i try to enter smaller set of data it reads, so what would you suggest to do. I searched net and could not find any proper solution.
    thanks!

  7. I think we should also categorize the data by lowercasing all the words:

    for r in short_pos.split('n'):
    documents.append((r.lower(),"pos"))

    for r in short_neg.split('n'):
    documents.append((r.lower(),"neg"))

  8. this is the error im getting,any clue why?

    File "C:/Users/Sree Hari/Desktop/New folder/nlp/algosent.py", line 94, in <module>
    MNB_classifier.train(training_set)
    File "C:python3libsite-packagesnltkclassifyscikitlearn.py", line 117, in train
    X = self._vectorizer.fit_transform(X)
    File "C:python3libsite-packagessklearnfeature_extractiondict_vectorizer.py", line 231, in fit_transform
    return self._transform(X, fitting=True)
    File "C:python3libsite-packagessklearnfeature_extractiondict_vectorizer.py", line 173, in _transform
    values.append(dtype(v))
    MemoryError

  9. i am getting error stating AttributeError: 'NoneType' object has no attribute 'append' when creating document file and all_words variable. could you please help me rectify this error

  10. Hi Harrison,
    I have a few question about training data: do you know how negative and positive reviews are generally tagged? Is it based on star-rating? If I wanted to do a similar analysis on a different domain, e.g. laptop reviews, would it be sufficient to categorize reviews with 1-2 stars as negative, 3 stars as neutral, and 4-5 stars as positive and use that as my training set? What would be your approach? Thanks

  11. This is a great video series teaching how to use NLTK with supervised learning. Do you have any videos on using NLTK with unsupervised learning?
    Ex: text mining newspaper articles and clustering articles based on their similarities as a way to recommend a bunch of news articles covering a person's topic of interest.

  12. Just tried implementing the code copied and files downloaded and get the following error:
    Traceback (most recent call last):
    File "/Users/imcnabb/Updated/Data Science/PYTHON/Intermediate Python/training_data.py", line 37, in <module>
    for r in short_pos.split('n'):
    TypeError: a bytes-like object is required, not 'str'

    How can I fix this?

  13. I have this error..please help me..

    File "C:UsersJeetAnaconda3libsite-packagesspyderutilssitesitecustomize.py", line 866, in runfile
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    exec(compile(f.read(), filename, 'exec'), namespace)
    File "D:/PROJECTS/Sentiment Analysis/nltkvid18.py", line 96, in <module>
    print("MNB_classifier accuracy percent:", (nltk.classify.accuracy(MNB_classifier, testing_set))*100)
    File "C:UsersJeetAnaconda3libsite-packagesnltkclassifyutil.py", line 87, in accuracy
    execfile(filename, namespace)
    File "C:UsersJeetAnaconda3libsite-packagesspyderutilssitesitecustomize.py", line 102, in execfile
    results = classifier.classify_many([fs for (fs, l) in gold])
    File "C:UsersJeetAnaconda3libsite-packagesnltkclassifyscikitlearn.py", line 83, in classify_many
    X = self._vectorizer.transform(featuresets)
    File "C:UsersJeetAnaconda3libsite-packagessklearnfeature_extractiondict_vectorizer.py", line 293, in transform
    raise ValueError("Sample sequence X is empty.")
    ValueError: Sample sequence X is empty.
    >>> return self._transform(X, fitting=False)
    File "C:UsersJeetAnaconda3libsite-packagessklearnfeature_extractiondict_vectorizer.py", line 184, in _transform
    I

  14. from nltk.corpus import PlaintextCorpusReader
    corpusdir = 'short_reviews/'
    short_reviews = PlaintextCorpusReader(corpusdir, '.*')

    short_pos_words = word_tokenize(short_pos.decode("latin-1"))
    short_neg_words = word_tokenize(short_neg.decode("latin-1"))

    worked for me for solving the encryption 0fx3 error

  15. I have a error please help me

    Traceback (most recent call last):
    File "C:Pythonsentiment18.py", line 97, in <module>
    MNB_Classifier.train(training_set)
    File "C:Usersjhon anthonyAppDataLocalProgramsPythonPython35-32libsite-packagesnltkclassifyscikitlearn.py", line 115, in train
    X = self._vectorizer.fit_transform(X)
    File "C:Usersjhon anthonyAppDataLocalProgramsPythonPython35-32libsite-packagessklearnfeature_extractiondict_vectorizer.py", line 231, in fit_transform
    return self._transform(X, fitting=True)
    File "C:Usersjhon anthonyAppDataLocalProgramsPythonPython35-32libsite-packagessklearnfeature_extractiondict_vectorizer.py", line 173, in _transform
    values.append(dtype(v))
    MemoryError

  16. Anyone else getting the following error:

    Traceback (most recent call last):
    File "/Users/jgarcia/Desktop/Python Tutorial Scripts/Text Classification.py", line 102, in <module>
    MNB_classifier.train(training_set)
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/classify/scikitlearn.py", line 114, in train
    X, y = list(compat.izip(*labeled_featuresets))
    ValueError: need more than 0 values to unpack

  17. Traceback (most recent call last):
    File "/Users/Tarundeep/Desktop/nltk/nltk_better_training_data.py", line 38, in <module>
    short_pos = open("short_reviews/positive.txt", "r").read()
    File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf3 in position 4645: invalid continuation byte

    Process finished with exit code 1

  18. Here, rev in featuresets has dictionary like values with words and polarity as value. Consider, I have a test dataset, which has only reviews, How to train the model using training dataset and ask the model to predict category for test dataset? If I train this feature set along with category, it is not accepting since rev column has dictionary like value. Your example shows only accuracy. How to use predict function following the same procedure?

  19. I'm getting this error: UnicodeDecodeError:
    'ascii' codec can't decode byte 0xc3 in position 6: ordinal not in range(128)

    When I print short_pos[:100] I get output like this:
    'xefxbbxbfthe rock is destined to be the 21st century's new " conan " and that he's going to make a splash '

    Using this solution:
    from unidecode import unidecode
    text = unidecode(text)
    still gives the same error, maybe it not able to decode the string using correct source coding.
    Saving the file .txt file in MS-Word does not help either.

    Can you please tell the correct source encoding of both positive.txt & negative.txt file so that I can decode them using
    unicode_str =short_pos.decode(<source encoding of the the string>)

    and then encode using encoded_str = unicode_str.encode("utf8")

  20. When i executed the code in this video.i got an error.
    ___________________________________________________________________________________________________________________

    MNB_classifier.train(training_set)
    File "/usr/local/lib/python3.4/dist-packages/nltk/classify/scikitlearn.py", line 115, in train
    X = self._vectorizer.fit_transform(X)
    File "/usr/local/lib/python3.4/dist-packages/sklearn/feature_extraction/dict_vectorizer.py", line 226, in fit_transform
    return self._transform(X, fitting=True)
    File "/usr/local/lib/python3.4/dist-packages/sklearn/feature_extraction/dict_vectorizer.py", line 195, in _transform
    result_matrix = result_matrix[:, map_index]
    File "/usr/local/lib/python3.4/dist-packages/scipy/sparse/csr.py", line 292, in _getitem_
    return sliced * P
    File "/usr/local/lib/python3.4/dist-packages/scipy/sparse/base.py", line 319, in _mul_
    return self._mul_sparse_matrix(other)
    File "/usr/local/lib/python3.4/dist-packages/scipy/sparse/compressed.py", line 499, in _mul_sparse_matrix
    data = np.empty(nnz, dtype=upcast(self.dtype, other.dtype))
    MemoryError

    ________________________________________________________________________________________________________________________

    anyone please help me

  21. I don't understand why it gives me error: Traceback (most recent call last):
      File "C:Python27NLTKNLTK BETTER TRAINING DATA.py", line 38, in <module>
        short_pos = open("short_reviews/positive.txt","r").read()
    IOError: [Errno 2] No such file or directory: 'short_reviews/positive.txt'I saved both files as positive.txt and negative.txt on desktop but it doesn't find them!

  22. I have everything the same as you do just for some reason I am experiencing this error which is given below

    **********************************************************************************************************************
    Traceback (most recent call last):
    File "new.py", line 53, in <module>
    short_pos_words = word_tokenize(short_pos)
    File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 104, in word_tokenize
    return [token for sent in sent_tokenize(text, language)
    File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 89, in sent_tokenize
    return tokenizer.tokenize(text)
    File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1226, in tokenize
    return list(self.sentences_from_text(text, realign_boundaries))
    File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1274, in sentences_from_text
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
    File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1265, in span_tokenize
    return [(sl.start, sl.stop) for sl in slices]
    File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1304, in _realign_boundaries
    for sl1, sl2 in _pair_iter(slices):
    File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 311, in _pair_iter
    for el in it:
    File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1280, in _slices_from_text
    if self.text_contains_sentbreak(context):
    File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1325, in text_contains_sentbreak
    for t in self._annotate_tokens(self._tokenize_words(text)):
    File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1460, in _annotate_second_pass
    for t1, t2 in _pair_iter(tokens):
    File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 310, in _pair_iter
    prev = next(it)
    File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 577, in _annotate_first_pass
    for aug_tok in tokens:
    File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 542, in _tokenize_words
    for line in plaintext.split('n'):
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xed in position 6: ordinal not in range(128)

    ******************************************************************************************************************
    plz help me to solve this error

  23. Hi,

    I am using python 3.4 on ubuntu 14.04.

    I have everything the same as you do just for some reason I am experiencing this error which is asking for the pickled file(or at least I think that is the case) even when i have commented out the lines of code which use the pickled file.

    here is the error:
    /usr/bin/python3.4 /home/ignuz/Desktop/sentiment/test.py
    Traceback (most recent call last):
    File "/home/ignuz/Desktop/sentiment/test.py", line 48, in <module>
    short_positive_words = word_tokenize(short_positive)
    File "/usr/local/lib/python3.4/dist-packages/nltk/tokenize/__init__.py", line 104, in word_tokenize
    return [token for sent in sent_tokenize(text, language)
    File "/usr/local/lib/python3.4/dist-packages/nltk/tokenize/__init__.py", line 88, in sent_tokenize
    tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
    File "/usr/local/lib/python3.4/dist-packages/nltk/data.py", line 796, in load
    opened_resource = _open(resource_url)
    File "/usr/local/lib/python3.4/dist-packages/nltk/data.py", line 914, in _open
    return find(path_, path + ['']).open()
    File "/usr/local/lib/python3.4/dist-packages/nltk/data.py", line 636, in find
    raise LookupError(resource_not_found)
    LookupError:
    ********************************************************************
    Resource 'tokenizers/punkt/PY3/english.pickle' not found.
    Please use the NLTK Downloader to obtain the resource: >>>
    nltk.download()
    Searched in:
    – '/home/ignuz/nltk_data'
    – '/usr/share/nltk_data'
    – '/usr/local/share/nltk_data'
    – '/usr/lib/nltk_data'
    – '/usr/local/lib/nltk_data'
    – ''
    ********************************************************************

    If you could let me know what is wrong, I would be grateful!

    Thanks

Leave a Reply

Your email address will not be published. Required fields are marked *