Are you happy with your logging solution? Would you help us out by taking a 30-second survey? Click here


Tokenization and pre-processing for Twitter data used to train classifiers.

Subscribe to updates I use tweetokenize

Statistics on tweetokenize

Number of watchers on Github 56
Number of open issues 2
Average time to close an issue about 8 hours
Main language Python
Open pull requests 1+
Closed pull requests 1+
Last commit about 6 years ago
Repo Created over 6 years ago
Repo Last Updated over 1 year ago
Size 192 KB
Organization / Authorjaredks
Page Updated
Do you use tweetokenize? Leave a review!
View open issues (2)
View tweetokenize activity
View on github
Fresh, new opensource launches 🚀🚀🚀
Trendy new open source projects in your inbox! View examples

Subscribe to our mailing list

Evaluating tweetokenize for your project? Score Explanation
Commits Score (?)
Issues & PR Score (?)


Regular expression based tokenizer for Twitter. Focused on tokenization and pre-processing to train classifiers for sentiment, emotion, or mood.

Intended as glue between Python wrappers for Twitter API and machine learning algorithms of the Natural Language Toolkit (NLTK), but probably applicable to tokenizing any short messages of the social networking variety.

from tweetokenize import Tokenizer
gettokens = Tokenizer()
gettokens.tokenize('hey playa!:):3.....@SHAQ can you still dunk?#oldLOL')
[u'hey', u'playa', u'!', u':)', u':3', u'...', u'USERNAME', u'can', u'you', u'still', u'dunk', u'?', u'#old', u'', u'', u'', u'LOL']


  • Can easily replace tweet features like usernames, urls, phone numbers, times, etc. with tokens in order to reduce feature set complexity and improve performance of classifiers
  • Allows user-defined sets of emoticons to be used in tokenization
  • Correctly separates emoji, written consecutively, into individual tokens


python install

After installation, you can make sure everything is working by running the following inside the project root folder,

python tests



Modified BSD License. See LICENSE for details. Copyright Jared Suttles, 2013.

tweetokenize open pull requests (View All Pulls)
  • Words like 'The' 'We' 'A' ... etc are not removed by stopwords
tweetokenize list of languages used
More projects by jaredks View all
Other projects in Python