Technology moves fast! ⚡ Don't get left behind.🚶 Subscribe to our mailing list to keep up with latest and greatest in open source projects! 🏆


Subscribe to our mailing list

gensim

Topic Modelling for Humans

Subscribe to updates I use gensim


Statistics on gensim

Number of watchers on Github 6492
Number of open issues 166
Average time to close an issue 4 days
Main language Python
Average time to merge a PR 5 days
Open pull requests 137+
Closed pull requests 82+
Last commit over 1 year ago
Repo Created over 8 years ago
Repo Last Updated over 1 year ago
Size 60.6 MB
Organization / Authorrare-technologies
Latest Release3.4.0
Contributors113
Page Updated
Do you use gensim? Leave a review!
View open issues (166)
View gensim activity
View on github
Fresh, new opensource launches 🚀🚀🚀
Trendy new open source projects in your inbox! View examples

Subscribe to our mailing list

Evaluating gensim for your project? Score Explanation
Commits Score (?)
Issues & PR Score (?)

gensim Topic Modelling in Python

Build Status GitHub release Conda-forge Build Wheel DOI Mailing List Gitter Follow

Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community.

Features

  • All algorithms are memory-independent w.r.t. the corpus size (can process input larger than RAM, streamed, out-of-core),
  • Intuitive interfaces
    • easy to plug in your own input corpus/datastream (trivial streaming API)
    • easy to extend with other Vector Space algorithms (trivial transformation API)
  • Efficient multicore implementations of popular algorithms, such as online Latent Semantic Analysis (LSA/LSI/SVD), Latent Dirichlet Allocation (LDA), Random Projections (RP), Hierarchical Dirichlet Process (HDP) or word2vec deep learning.
  • Distributed computing: can run Latent Semantic Analysis and Latent Dirichlet Allocation on a cluster of computers.
  • Extensive documentation and Jupyter Notebook tutorials.

If this feature list left you scratching your head, you can first read more about the Vector Space Model and unsupervised document analysis on Wikipedia.

Support

Please raise potential bugs on github. See Contribution Guide prior to raising an issue.

If you have an open-ended or a research question:

Installation

This software depends on NumPy and Scipy, two Python packages for scientific computing. You must have them installed prior to installing gensim.

It is also recommended you install a fast BLAS library before installing NumPy. This is optional, but using an optimized BLAS such as ATLAS or OpenBLAS is known to improve performance by as much as an order of magnitude. On OS X, NumPy picks up the BLAS that comes with it automatically, so you dont need to do anything special.

The simple way to install gensim is:

pip install -U gensim

Or, if you have instead downloaded and unzipped the source tar.gz package, youd run:

python setup.py test
python setup.py install

For alternative modes of installation (without root privileges, development installation, optional install features), see the documentation.

This version has been tested under Python 2.7, 3.5 and 3.6. Gensims github repo is hooked against Travis CI for automated testing on every commit push and pull request. Support for Python 2.6, 3.3 and 3.4 was dropped in gensim 1.0.0. Install gensim 0.13.4 if you must use Python 2.6, 3.3 or 3.4. Support for Python 2.5 was dropped in gensim 0.10.0; install gensim 0.9.1 if you must use Python 2.5).

How come gensim is so fast and memory efficient? Isnt it pure Python, and isnt Python slow and greedy?

Many scientific algorithms can be expressed in terms of large matrix operations (see the BLAS note above). Gensim taps into these low-level BLAS libraries, by means of its dependency on NumPy. So while gensim-the-top-level-code is pure Python, it actually executes highly optimized Fortran/C under the hood, including multithreading (if your BLAS is so configured).

Memory-wise, gensim makes heavy use of Pythons built-in generators and iterators for streamed data processing. Memory efficiency was one of gensims design goals, and is a central feature of gensim, rather than something bolted on as an afterthought.

Documentation


Adopters

Name Logo URL Description
RaRe Technologies rare rare-technologies.com Machine learning & NLP consulting and training. Creators and maintainers of Gensim.
Mindseye mindseye mindseye.com Similarities in legal documents
Talentpair talent-pair talentpair.com Data science driving high-touch recruiting
Tailwind tailwind Tailwindapp.com Post interesting and relevant content to Pinterest
Issuu issuu Issuu.com Gensims LDA module lies at the very core of the analysis we perform on each uploaded publication to figure out what its all about.
Sports Authority sports-authority sportsauthority.com Text mining of customer surveys and social media sources
Search Metrics search-metrics searchmetrics.com Gensim word2vec used for entity disambiguation in Search Engine Optimisation
Cisco Security cisco cisco.com Large-scale fraud detection
12K Research 12k 12k.co Document similarity analysis on media articles
National Institutes of Health nih github/NIHOPA Processing grants and publications with word2vec
Codeq LLC codeq codeq.com Document classification with word2vec
Mass Cognition mass-cognition masscognition.com Topic analysis service for consumer text data and general text data
Stillwater Supercomputing stillwater stillwater-sc.com Document comprehension and association with word2vec
Channel 4 channel4 channel4.com Recommendation engine
Amazon amazon amazon.com Document similarity
SiteGround Hosting siteground siteground.com An ensemble search engine which uses different embeddings models and similarities, including word2vec, WMD, and LDA.
Juju juju www.juju.com Provide non-obvious related job suggestions.
NLPub nlpub nlpub.org Distributional semantic models including word2vec.
Capital One capitalone www.capitalone.com Topic modeling for customer complaints exploration.

Citing gensim

When citing gensim in academic papers and theses, please use this BibTeX entry:

@inproceedings{rehurek_lrec,
      title = {{Software Framework for Topic Modelling with Large Corpora}},
      author = {Radim {\v R}eh{\r u}{\v r}ek and Petr Sojka},
      booktitle = {{Proceedings of the LREC 2010 Workshop on New
           Challenges for NLP Frameworks}},
      pages = {45--50},
      year = 2010,
      month = May,
      day = 22,
      publisher = {ELRA},
      address = {Valletta, Malta},
      note={\url{http://is.muni.cz/publication/884893/en}},
      language={English}
}
gensim open issues Ask a question     (View All Issues)
  • over 2 years Loss through each iteration in skip gram
  • over 2 years Dynamic-NMF in gensim
  • over 2 years Consider dropping Python 2.6 (and maybe 3.3) support/CI-testing
  • over 2 years TestWikiCorpus intermittent test hangs
  • over 2 years Add automatic PEP8 checking to Travis
  • over 2 years CBOW model equivalent to the supervised learning model of fastText
  • over 2 years Directory names on Windows 10 failing. ( '/tmp' )
  • over 2 years Add more tests for HdpModel
  • over 2 years HDP improvements: Topic Hierarchy and Tutorial
  • over 2 years HdpModel does not document inference using []
  • over 2 years The automated tests for notebooks from #874 do not run during CI
  • over 2 years Support Pyro 4.47 in LDA and LSI distributed
  • over 2 years Modifying train_cbow_pair
  • over 2 years Distributed LDA "ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()"
  • over 2 years per_word_topics for a corpus
  • over 2 years Documentation overhaul
  • over 2 years Document Influence Model
  • over 2 years hdp_to_lda should return LdaModel instance
  • over 2 years Segmentation fault when creating a Word2Vec model.
  • over 2 years Adding a conda-forge recipe for Gensim
  • over 2 years Update wikipedia tutorial
  • almost 3 years Adding Word Prediction
  • almost 3 years DTM ldaseqmodel tutorial notebook - Corpus and cleaning instructions
  • almost 3 years Unpickling models across python3 and python2
  • almost 3 years Improvements to Dynamic Topic Models
  • almost 3 years trim_rule doesn't work if model initialization and vocabulary building are separated
  • almost 3 years Inferring doc topics from loaded LdaMallet fails due to missing corpus .mallet file
  • almost 3 years WARNING : supplied example count did not equal expected count
  • almost 3 years Doc2Vec defaults shouldn't override Word2Vec defaults without good reason
  • almost 3 years Doc2Vec.DocvecsArray.doctag_syn0norm not ignored in save()
gensim open pull requests (View All Pulls)
  • [WIP] Distance Metrics Notebook
  • [MRG] Topic Coherence
  • [WIP] DTM sample classes, helper methods.
  • Adopters and Resources Pages
  • Changed keywords.py to make imports work for python2
  • [WIP] Add support for using Annoy as an external similarity index
  • WIP Doc2vec on proteins example in iPython notebook
  • Online word2vec
  • [WIP] Implement save_word2vec_format can for Doc2Vec
  • [WIP] Doc2vec to wikipedia
  • Initial wmd release2
  • Zach fastsent
  • Initial wmd release
  • Flexible API for nearest neighbour search
  • Fastsent
  • Fix RP model loading for large Fortran-order arrays
  • Docker instructions in README.rst
  • Ldamodel cython
  • Example Cython vocabulary count code (depends on external libraries, may complicate build)
  • cbow_mean default changed from 0 to 1.
  • [WIP] Implement Word Mover's Distance (WMD) in Gensim
  • [WIP] Replace frequency counting using dict by a combination of hyperloglog and CountMinSketch
  • Added method to restrict vocab of Word2Vec most similar search
  • Parallel scan_vocab
  • Faster analogies
  • [Word2Vec] Support for reading calculated skip ngrams
  • Match the Annoy index similarity API to the Word2Vec API
  • Distributed lda options
  • minor docfix for word2vec trim_rule
  • Added new tutorial+slides to tutorials.md
  • LDA tutorial, tips and tricks
  • Online word2vec test
  • n_similarity() in word2vec and doc2vec raises ValueError if an empty list is passed
  • Sklearn wrapper integration
  • Add auto-generated docs to gitignore
  • Improve usage instructions for glove2word2vec script.
  • Loading and Saving LDA Models across Python 2 and 3.
  • [WIP] Author-topic model
  • [WIP] Raise warning if calling load() on an instance rather than the class
  • Process cross index queries in parallel
  • [WIP] Doc fixes and dict logging
  • [WIP] Wrapper for FastText
  • Correct logic for iterating over SimilarityABC,
  • [WIP] Phrases optimizations
  • Hmj fastsent
  • KeyedVectors refactor for word2vec
  • [WIP] Add ability to use Tensorflow to train a word2vec model
  • Potential Word Movers Distance performance improvement: WCD and RWMD
  • The end index in list slicing is exclusive
  • Optimising sigmoid function everywhere
  • Word2Vec/Doc2Vec offer model-minimization method Fix issue #446
  • Adding docs with CI using travis-sphinx, Fix #907
  • KeyedVecs refactoring for word2vec
  • Fix PR #963 : Passing all the params through the apply call in lda.get_document_topics
  • Add pep8 to Travis
  • [DNM] PR #929 double-checking
  • Add tests for ipython notebooks to run during Travis CI
  • Resolving issue #908
  • [WIP] Add deployment option for docs
  • [WIP]Adding sklearn wrapper for LDA code
  • Removing Doc2Vec defaults so that it won't override Word2Vec defaults. fix #795
  • Add per_word_topics parameter to apply call
  • Pep8 fixes
  • [WIP] [DNM] Keyedvector load word2vec format
  • Converting notebooks to html for further publishing
  • [WIP] Wrapper for Varembed Models
  • added wordrank wrapper
  • Ignore DocvecsArray.doctag_syn0norm in save
  • [WIP] TensorFlow wrapper for using GPU
  • Open the vector file as utf-8 under Python 3 on Windows.
  • PEP8 Fixes for Summarization.
  • Fixing PR 1326 and providing some tests for unicode wiki corpora
  • Fix issue-1310
  • Add exception to check for morfessor import
  • [WIP] Fix backward incompatibility due to `random_state`
  • Sparse support when num_best not None in interfaces. Fixes #1294
  • [WIP] support both old and new fastText model
  • Add KeyedVectors support to AnnoyIndexer
  • Code style fixes to the TFIDF module
  • Corpus streaming tutorial changes
  • Word2Vec default constants
  • Add common terms phrases model
  • [WIP] Keras wrapper for Word2Vec model in Gensim
  • [WIP][DNM] Visualize topic model difference (need feedback)
  • [WIP] Add new restrict_vocab functionality, most_similar_among
  • [WIP] Add support for PySpaRNN (#1224)
  • [WIP] index object pickle forced to be always under `index.output_prefix` in Similarity
  • [WIP] A framework for distributed Word2Vec using Distributed Tensorflow
  • [WIP] Computes training loss for skip gram model. Fixes issue #999.
  • gensim models show_topic/print_topic parameter num_words changed to topn to match other topic models
  • Added function to replace gensim check_output with subprocess.check_output
  • [WIP] Labeled w2v
  • Refactoring LDAModel's 'top_topics'; now uses CoherenceModel(..)
  • [WIP] Adding docker containers in Gensim. Fix #497
  • [WIP] Deeplearning4j word2vec wrapper stub
  • Fix type issue in `gensim.matutils.unitvec`. Fix #1722
  • Update documentation for gensim.similarities.docsim
  • Store images from README directly in repository. Fix #1849
  • Fix documentation for `gensim.models.wrappers`
  • Fix pure python implementation of doc2vec (w/online-learning). Partial fix #1019
  • Fix bug in `Phrases`. Fix #1401
  • Fix `D2VTransformer.fit_transform`. Fix #1834
  • Fix 1779
  • Refactor API reference gensim.corpora: lowcorpus.py, malletcorpus.py & mmcorpus.py. Fix #1671
  • Implement Soft Cosine Measure
  • [WIP] simple cython version of mmreader
  • Fix datatype parameter for `KeyedVectors.load_word2vec_format`. Fix #1682
  • Add notebook for decomposition using svd and nmf
  • Replace open() with smart_open() in notebooks. Fix #1789
  • Reformat API reference
  • Refactor docstrings for `gensim.scripts`. Fix #1665
  • Pivoted normalization for tfidf model fix #220 [WIP]
  • [WIP] partially cython-ized version of ldamodel
  • Feature: New Author Inference
  • Fix dtype of `matutils.unitvec`. Fix #1722
  • [WIP] Native implementation of sent2vec in gensim
  • [WIP] Topic model visualization
  • Check path to executable for `gensim.utils.check_output`. Fix #1485
  • Fix deprecated parameters in `D2VTransformer` and `W2VTransformer`. Fix #1937
  • Increased default restrict_vocab in accuracy
  • Refactor documentation for `gensim.models.coherencemodel`.
  • [WIP] Topic model visualization
  • Akshay sentence mixture
  • Update word2vec model docstring to numpy-style
  • Fix method `estimate_memory` from `gensim.models.FastText` & huge performance improvement. Fix #1824
  • Addresses #465 : allow initialization with `max_vocab` in lieu of `min_count`
  • Fix docstrings for`gensim.models.hdpmodel`, `gensim.models.lda_worker` & `gensim.models.lda_dispatcher`(#1667)
  • Add docstrings for Author-topic model
  • [DNM][WIP] New site for gensim documentation
  • [DNM] Optimize sparse * random dense matrix multiply in LsiModel
  • Sklearn API docstrings
  • Fix encoding bug in `gensim.scripts.word2vec2tensor`. Fix #1958
  • Fix test for `gensim.models.KeyedVectors.similarity_matrix` method. Fix #1961
  • Adresses #1654 Use Bounter for approx frequency counting
  • Add `gensim.models.BaseKeyedVectors.add_entity` method for fill `KeyedVectors` in manual way. Fix #1942
  • Refactor documentation for `gensim.models.phrases`
  • Refactor documentation for `*2Vec` models
gensim questions on Stackoverflow (View All Questions)
  • How to install gensim
  • ImportError: cannot import name corpora with Gensim
  • How to get vocabulary word count from gensim word2vec?
  • Gensim - Doc2Vec learning - Wrong Tags Predicted
  • Using Latent Dirichlet Allocation with Gensim
  • number of vocabulary in gensim is much lower than the ones in training data
  • How to predict the topic of a new query using a trained LDA model using gensim?
  • How to install Gensim version 0.11.1 on Windows 10 Machine?
  • Python gensim installation issue on Mac
  • What does size parameter in gensim doc2vec represent
  • Gensim, Cython and noGIL - something leads to Python console abortion
  • How to implement hlda transformation to find correlation of topics in gensim?
  • Different models with gensim Word2Vec on python
  • gensim with error of GPU memory access
  • How to get the Document Vector from Doc2Vec in gensim 0.11.1?
  • Understanding LDA Transformed Corpus in Gensim
  • pip installation of gensim - 'ascii' codec can't decode byte 0xe2 in position 53
  • Clustering using Latent Dirichlet Allocation algo in gensim
  • Are there any efficient python libraries for Dynamic Topic Models, preferably extending Gensim?
  • Tracking document ids in gensim to get the topic distribution of the source text
  • How to monitor convergence of Gensim LDA model?
  • How to increase Dictionary size in gensim while making Corpus?
  • import error when using gensim to topic model wikipedia
  • How to get word vectors from a gensim Doc2Vec?
  • Negative value interpretation in LSI Corpus generated by Python Gensim?
  • Gensim - Get distance based on cosine similarity
  • Gensim Word2Vec distances are too close
  • How can I run this gensim code? Do I need some text files?
  • How to print out the full distribution of words in an LDA topic in gensim?
  • LDA Gensim Word -> Topic Ids Distribution instead of Topic -> Word Distribution
gensim list of languages used
gensim latest release notes

3.4.0, 2018-03-01

:star2: New features:

  • Massive optimizations of gensim.models.LdaModel: much faster training, using Cython. (@arlenk, #1767)

    • Training benchmark :boom:

    | dataset | old LDA [sec] | optimized LDA [sec] | speed up | |---------|---------------|---------------------|---------| | nytimes | 3473 | 1975 | 1.76x | | enron | 774 | 437 | 1.77x |

    • This change affects all models that depend on LdaModel, such as LdaMulticore, LdaSeqModel, AuthorTopicModel.
  • Huge speed-ups to corpus I/O with MmCorpus (Cython) (@arlenk, #1825)

    • File reading benchmark

    | dataset | file compressed? | old MmReader [sec] | optimized MmReader [sec] | speed up | |---------------|:-----------:|:------------:|:------------------:|:-------------:| | enron | no | 22.3 | 2.6 | 8.7x | | | yes | 37.3 | 14.4 | 2.6x | | nytimes | no | 419.3 | 49.2 | 8.5x | | | yes | 686.2 | 275.1 | 2.5x | | text8 | no | 25.4 | 2.5 | 10.1x | | | yes | 41.9 | 17.0 | 2.5x |

    • Overall, a 2.5x speedup for compressed .mm.gz input and 8.5x :fire::fire::fire: for uncompressed plaintext .mm.
  • Performance and memory optimization to gensim.models.FastText :rocket: (@jbaiter, #1916)

    • Benchmark (first 500,000 articles from English Wikipedia)

    | Metric | old FastText | optimized FastText | improvement | | -----------------------| -----------------| -------------------|-------------| | Training time (1 epoch) | 4823.4s (80.38 minutes) | 1873.6s (31.22 minutes) | 2.57x | | Training time (full) | 1h 26min 13s | 36min 43s | 2.35x | | Training words/sec | 72,781 | 187,366 | 2.57x | | Training peak memory | 5.2 GB | 3.7 GB | 1.4x |

    • Overall, a 2.5x speedup & memory usage reduced by 30%.
  • Implemented Soft Cosine Measure (@Witiko, #1827)

    | Technique | MAP score | Duration | |-----------|-----------|--------------| | softcossim| 45.99 | 1.24 sec | | wmd-relax | 44.48 | 12.22 sec | | cossim | 44.22 | 4.39 sec | | wmd-gensim| 44.08 | 98.29 sec |

:+1: Improvements:

  • New method to show the Gensim installation parameters: python -m gensim.scripts.package_info --info. Use this when reporting problems, for easier debugging. Fix #1902 (@sharanry, #1903)
  • Added a flag to optionally skip network-related tests, to help maintainers avoid network issues with CI services (@menshikh-iv, #1930)
  • Added license field to setup.py, allowing the use of tools like pip-licenses (@nils-werner, #1909)

:red_circle: Bug fixes:

  • Fix Python 3 compatibility for gensim.corpora.UciCorpus.save_corpus (@darindf, #1875)
  • Add wv property to KeyedVectors for backward compatibility. Fix #1882 (@manneshiva, #1884)
  • Fix deprecation warning from inspect.getargspec. Fix #1878 (@aneesh-joshi, #1887)
  • Add LabeledSentence to gensim.models.doc2vec for backward compatibility. Fix #1886 (@manneshiva, #1891)
  • Fix empty output bug in Phrases (when using model[tokens] twice). Fix #1401 (@sj29-innovate, #1853)
  • Fix type problems for D2VTransformer.fit_transform. Fix #1834 (@Utkarsh-Mishra-CIC, #1845)
  • Fix datatype parameter for KeyedVectors.load_word2vec_format. Fix #1682 (@pushpankar, #1819)
  • Fix deprecated parameters in doc2vec-lee notebook (@TheFlash10, #1918)
  • Fix file-like closing bug in gensim.corpora.MmCorpus. Fix #1869 (@sj29-innovate, #1911)
  • Fix precision problem in test_similarities.py, no more FP fails. (@menshikh-iv, #1928)
  • Fix encoding in Lee corpus reader. (@menshikh-iv, #1931)
  • Fix OOV pairs counter in WordEmbeddingsKeyedVectors.evaluate_word_pairs. (@akutuzov, #1934)

:books: Tutorial and doc improvements:

:warning: Deprecations (will be removed in the next major release)

  • Remove

    • gensim.models.wrappers.fasttext (obsoleted by the new native gensim.models.fasttext implementation)
    • gensim.examples
    • gensim.nosy
    • gensim.scripts.word2vec_standalone
    • gensim.scripts.make_wiki_lemma
    • gensim.scripts.make_wiki_online
    • gensim.scripts.make_wiki_online_lemma
    • gensim.scripts.make_wiki_online_nodebug
    • gensim.scripts.make_wiki (all of these obsoleted by the new native gensim.scripts.segment_wiki implementation)
    • deprecated functions and attributes
  • Move

    • gensim.scripts.make_wikicorpus gensim.scripts.make_wiki.py
    • gensim.summarization gensim.models.summarization
    • gensim.topic_coherence gensim.models._coherence
    • gensim.utils gensim.utils.utils (old imports will continue to work)
    • gensim.parsing.* gensim.utils.text_utils

3.3.0, 2018-02-02

:star2: New features:

  • Re-designed all *2vec implementations (@manneshiva, #1777)

  • Improve gensim.scripts.segment_wiki by retaining interwiki links. Fix #1712 (@steremma, PR #1839)

    • Optionally extract interlinks from Wikipedia pages (use the --include-interlinks option). This will output one additional JSON dict for each article: { "interlinks": { "article title 1": "interlink text 1", "article title 2": "interlink text 2", ... } }
    • Example: extract the Wikipedia graph with article links as edges, from a raw Wikipedia dump:

      python -m gensim.scripts.segment_wiki --include-interlinks --file ~/Downloads/enwiki-latest-pages-articles.xml.bz2 --output ~/Desktop/enwiki-latest.jsonl.gz
      
      • Read this field from the segment_wiki output:
      import json
      from smart_open import smart_open
      
      with smart_open("enwiki-latest.jsonl.gz") as infile:
          for doc in infile:
              doc = json.loads(doc)
      
              src_node = doc['title']
              dst_nodes = doc['interlinks'].keys()
      
              print(u"Source node: {}".format(src_node))
              print(u"Destination nodes: {}".format(u", ".join(dst_nodes)))
              break
      
      """
      OUTPUT:
      
      Source node: Anarchism
      Destination nodes: anarcha-feminist, Ivan Illich, Adolf Brand, Josiah Warren, will (philosophy), anarcha-feminism, Anarchism in Mexico, Lysander Spooner, English Civil War, G8, Sebastien Faure, Nihilist movement, Sbastien Faure, Left-wing politics, imamate, Pierre Joseph Proudhon, anarchist communism, Universit popolare (Italian newspaper), 1848 Revolution, Synthesis anarchism, labour movement, anarchist communists, collectivist anarchism, polyamory, post-humanism, postcolonialism, anti war movement, State (polity), security culture, Catalan people, Stoicism, Progressive education, stateless society, Umberto I of Italy, German language, Anarchist schools of thought, NEFAC, Jacques Ellul, Spanish Communist Party, Crypto-anarchism, ruling class, non-violence, Platformist, The History of Sexuality, Revolutions of 191723, Federacin Anarquista Ibrica, propaganda of the deed, William B. Greene, Platformism, mutually exclusive, Fraye Arbeter Shtime, Adolf Hitler, oxymoron, Paris Commune, Anarchism in Italy#Postwar years and today, Oranienburg, abstentionism, Free Society, Henry David Thoreau, privative alpha, George I of Greece, communards, Gustav Landauer, Lucifer the Lightbearer, Moses Harman, coercion, regicide, rationalist, Resistance during World War II, Christ (title), Bohemianism, individualism, Crass, black bloc, Spanish Revolution of 1936, Erich Mhsam, Empress Elisabeth of Austria, Free association (communism and anarchism), general strike, Francesc Ferrer i Gurdia, Catalan anarchist pedagogue and free-thinker, veganarchism, Traditional knowledge, Japanese Anarchist Federation, Diogenes of Sinope, Hierarchy, sexual revolution, Naturism, Bavarian Soviet Republic, February Revolution, Eugene Varlin, Renaissance humanism, Mexican Liberal Party, Friedrich Engels, Fernando Tarrida del Mrmol, Caliphate, Marxism, Jesus, John Cage, Umanita Nova, Anarcho-pacifism, Peter Kropotkin, Religious anarchism, Anselme Bellegarrigue, civilisation, moral obligation, hedonist, Free Territory (Ukraine), -ism, neo-liberalism, Austrian School, philosophy, freethought, Joseph Goebbels, Conservatism, anarchist economics, Cavalier, Maximilien de Robespierre, Comstockery, Dorothy Day, Anarchism in France, Fdration anarchiste, World Economic Forum, Amparo Poch y Gascn, Sex Pistols, women's rights, collectivisation, Taoism, common ownership, William Batchelder Greene, Collective farming, popular education, biphobia, targeted killings, Protestant Christianity, state socialism, Marie Franois Sadi Carnot, Stephen Pearl Andrews, World Trade Organization, Communist Party of Spain (main), Pluto Press, Levante, Spain, Alexander Berkman, Wilhelm Weitling, Kharijites, Bolshevik, Liberty (18811908), Anarchist Aragon, social democrats, Dielo Truda, Post-left anarchy, Age of Enlightenment, Blanquism, Walden, mutual aid (organization), Far-left politics, privative, revolutions of 1848, anarchism and nationalism, punk rock, tienne de La Botie, Max Stirner, Jacobin (politics), agriculture, anarchy, Confederacion General del Trabajo de Espaa, toleration, reformism, International Anarchist Congress of Amsterdam, The Ego and Its Own, Ukraine, Civil Disobedience (Thoreau), Spanish Civil War, David Graeber, Anarchism and issues related to love and sex, James Guillaume, Insurrectionary anarchism, Political repression, International Workers' Association, Barcelona, Bulgaria, Voline, Zeno of Citium, anarcho-communists, organized religion, libertarianism, bisexuality, Ricardo Flores Magn, Henri Zisly, Eight-hour day, Freetown Christiania, heteronormativity, Mikhail Bakunin, Propagandaministerium, Ezra Heywood, individual reappropriation, Modern School (United States), archon, Confdration nationale du travail, socialist movement, History of Islam, Max Nettlau, Political Justice, Reichstag fire, Anti-Christianity, decentralised, Issues in anarchism#Communism, deschooling, Christian movement, squatter, Anarchism in Germany, Catalonia, Louise Michel, Solidarity Federation, What is Property?, European individualist anarchism, Pierre-Joseph Proudhon, Mexican Revolution, wikt:anarchism, Blackshirts, Jewish anarchism, Russian Civil War, property rights, anti-authoritarian, individual reclamation, propaganda by the deed, from each according to his ability, to each according to his need, Feminist movement, Confiscation, social anarchism, Anarchism in Russia, Daniel Gurin, Uruguayan Anarchist Federation, Anarcha-feminism, Enrags, Cynicism (philosophy), workers' council, The Word (free love), Allen Ginsberg, Campaign for Nuclear Disarmament, antimilitarism, Workers' self-management, Federacin Obrera Regional Argentina, self-governance, free market, Carlos I of Portugal, Simon Critchley, Anti-clericalism, heterosexual, Layla AbdelRahim, Mexican Anarchist Federation, Anarchism and Marxism, October Revolution, Anti-nuclear movement, Joseph Djacque, Bolsheviks, Luigi Fabbri, morality, Communist party, Sam Dolgoff, united front, Ammon Hennacy, social ecology, commune (intentional community), Oscar Wilde, French Revolution, egoist anarchism, Comintern, transphobia, anarchism without adjectives, social control, means of production, Michel Onfray, Anarchism in France#The Fourth Republic (19451958), syndicalism, Anarchism in Spain, Iberian Anarchist Federation, International of Anarchist Federations, Emma Goldman, Netherlands, anarchist free school, International Workingmen's Association, Queer anarchism, Cantonal Revolution, trade unionism, Karl Marx, LGBT community, humanism, Anti-fascism, Carrara, political philosophy, Anarcho-transhumanism, libertarian socialist, Russian Revolution (1917), Two Cheers for Anarchism: Six Easy Pieces on Autonomy, Dignity, and Meaningful Work and Play, Emile Armand, insurrectionary anarchism, individual, Zhuang Zhou, Free Territory, White movement, Greenwich Village, Virginia Bolten, transcendentalist, public choice theory, wikt:brigand, Issues in anarchism#Participation in statist democracy, free love, Mutualism (economic theory), Anarchist St. Imier International, censorship, federalist, 6 February 1934 crisis, biennio rosso, anti-clerical, centralism, Anarchism: A Documentary History of Libertarian Ideas, minarchism, James C. Scott, First International, homosexuality, political theology, spontaneous order, Oranienburg concentration camp, anarcho-communism, negative liberty, post-modernism, Anarchism in Italy, Leopold Kohr, union of egoists, counterculture, Miguel Gimenez Igualada, philosophical anarchism, International Libertarian Solidarity, homosexual, Counterculture of the 1960s, Errico Malatesta, strikebreaker, Workers' Party of Marxist Unification, Clifford Harper, Reification (fallacy), patriarchy, anarchist law, Apostle (Christian), market (economics), Summerhill School, positive liberty, socialism, feminism, Direct action, Melchor Rodrguez Garca, William Godwin, Nazi concentration camps, Synthesist anarchism, Margaret Anderson, Han Ryner, Federation of Organized Trades and Labor Unions, technology, Workers Solidarity Movement, Edmund Burke, Encyclopdia Britannica, state (polity), Herbert Read, Park Gell, utilitarian, far right leagues, Limited government, self-ownership, Pejorative, homophobia, Industrial Workers of the World, The Dispossessed, Hague Congress (1872), Stalinism, Reciprocity (cultural anthropology), Fernand Pelloutier, individualist anarchism in France, The False Principle of our Education, individualist anarchism, Pierre Monatte, Soviet Union, counter-economics, Rudolf Rocker, Anarchism and capitalism, Parma, Black Rose Books, lesbian, Arditi del Popolo, Emile Armand (18721962), who propounded the virtues of free love in the Parisian anarchist milieu of the early 20th century, collectivism, Development criticism, John Henry Mackay, Benot Broutchoux, Illegalism, Laozi, feminist, Christiaan Cornelissen, Syndicalist Workers' Federation, anarcho-syndicalism, Andalusia, Renzo Novatore, trade union, autonomist marxism, dictatorship of the proletariat, Mujeres Libres, Voltairine de Cleyre, Post-anarchism, participatory economics, Confederacin Nacional del Trabajo, Syncretic politics, direct democracy, Jean-Jacques Rousseau, Green anarchism, Surrealism, labour unions, A. S. Neill, christian anarchist, Bonnot Gang, Anti-capitalism, Anarchism in Brazil, simple living, enlightened self-interest, Confdration gnrale du travail, class conflict, International Workers' Day, Hbertists, Gerrard Winstanley, Francoism, anarcho-pacifist, Andrej Grubacic, individualist anarchist and social anarchist thinkers., April Carter, private property, penal colonies, Libertarian socialism, Camillo Berneri, Christian anarchism, transhumanism, Lucifer, the Light-Bearer, Edna St. Vincent Millay, unschooling, Leo Tolstoy, M. E. Lazarus, Spanish Anarchists, Buddhist anarchism, ideology, William McKinley, anarcho-primitivism, Francesc Pi i Margall, :Category:Anarchism by country, International Workers Association, Anarcho-capitalism, Lois Waisbrooker, wikt:Solidarity, Baja California, social revolution, Unione Sindacale Italiana, Lev Chernyi, Alex Comfort, Sonnenburg, Leon Czolgosz, Volin, utopian, Argentine Libertarian Federation, Nudism, Left-wing market anarchism, insurrection, definitional concerns in anarchist theory, infinitive, affinity group, World Trade Organization Ministerial Conference of 1999 protest activity, class struggle, nonviolence, John Zerzan, poststructuralist, Noam Chomsky, Second Fitna, Julian Beck, Philadelphes, League of Peace and Freedom, Fdration Anarchiste, Kronstadt rebellion, Cold War, Andr Breton, Silvio Gesell, libertarian anarchism, voluntary association, anti-globalisation movement, birth control, L. Susan Brown, anarcho-naturism, personal property, Roundhead, Harold Barclay, The Joy of Sex, Council communism, Luca Snchez Saornil, tyrannicide, Neopaganism, lois sclrates, Johann Most, Anarchist Catalonia, Albert Camus, Protests of 1968, Alexander II of Russia, Spain's economy, Federazione Anarchica Italiana, Cuba, German Revolution of 19181919, stirner, Property is theft, Situationist International, law and economics
      
      
  • Add support for SMART notation for TfidfModel. Fix #1785 (@markroxor, #1791)

    • Natural extension of TfidfModel to allow different weighting and normalization schemes

      from gensim.corpora import Dictionary
      from gensim.models import TfidfModel
      import gensim.downloader as api
      
      data = api.load("text8")
      dct = Dictionary(data)
      corpus = [dct.doc2bow(line) for line in data]
      
      # Train Tfidf model using the SMART notation, smartirs="ntc" where
      # 'n' - natural term frequency
      # 't' - idf document frequency
      # 'c' - cosine normalization
      #
      # More information about possible values available in documentation or https://nlp.stanford.edu/IR-book/html/htmledition/document-and-query-weighting-schemes-1.html
      
      model = TfidfModel(corpus, id2word=dct, smartirs="ntc")
      vectorized_corpus = list(model[corpus])
      
      
    • SMART Information Retrieval System (wiki)

  • Add CircleCI for building Gensim documentation. Fix #1807 (@menshikh-iv, #1822)

    • An easy way to preview the rendered documentation (especially, if don't use Linux)
      • Go to Details link of CircleCI in your PR, click on the Artifacts tab, choose the HTML file that you want to view; a new tab will open with the rendered HTML page
    • Integration with Github, to see the documentation directly from the pull request page

:red_circle: Bug fixes:

:books: Tutorial and doc improvements:

:+1: Improvements:

:warning: Deprecations (will be removed in the next major release)

  • Remove

    • gensim.models.wrappers.fasttext (obsoleted by the new native gensim.models.fasttext implementation)
    • gensim.examples
    • gensim.nosy
    • gensim.scripts.word2vec_standalone
    • gensim.scripts.make_wiki_lemma
    • gensim.scripts.make_wiki_online
    • gensim.scripts.make_wiki_online_lemma
    • gensim.scripts.make_wiki_online_nodebug
    • gensim.scripts.make_wiki (all of these obsoleted by the new native gensim.scripts.segment_wiki implementation)
    • deprecated functions and attributes
  • Move

    • gensim.scripts.make_wikicorpus gensim.scripts.make_wiki.py
    • gensim.summarization gensim.models.summarization
    • gensim.topic_coherence gensim.models._coherence
    • gensim.utils gensim.utils.utils (old imports will continue to work)
    • gensim.parsing.* gensim.utils.text_utils

3.2.0, 2017-12-09

:star2: New features:

  • New download API for corpora and pre-trained models (@chaitaliSaini & @menshikh-iv, #1705 & #1632 & #1492)

    • Download large NLP datasets in one line of Python, then use with memory-efficient data streaming:

      import gensim.downloader as api
      
      for article in api.load("wiki-english-20171001"):
          print(article)
      
      
    • Dont waste time searching for good word embeddings, use the curated ones:

      import gensim.downloader as api
      
      model = api.load("glove-twitter-25")
      model.most_similar("engineer")
      
      # [('specialist', 0.957542896270752),
      #  ('developer', 0.9548177123069763),
      #  ('administrator', 0.9432312846183777),
      #  ('consultant', 0.93915855884552),
      #  ('technician', 0.9368376135826111),
      #  ('analyst', 0.9342101216316223),
      #  ('architect', 0.9257484674453735),
      #  ('engineering', 0.9159940481185913),
      #  ('systems', 0.9123805165290833),
      #  ('consulting', 0.9112802147865295)]
      
    • Blog post introducing the API and design decisions.

    • Jupyter notebook with examples

  • New model: Poincar embeddings (@jayantj, #1696 & #1700 & #1757 & #1734)

    • Embed a graph (taxonomy) in the same way as word2vec embeds words:

      from gensim.models.poincare import PoincareRelations, PoincareModel
      from gensim.test.utils import datapath
      
      data = PoincareRelations(datapath('poincare_hypernyms.tsv'))
      model = PoincareModel(data)
      model.kv.most_similar("cat.n.01")
      
      # [('kangaroo.n.01', 0.010581353439700418),
      # ('gib.n.02', 0.011171531439892076),
      # ('striped_skunk.n.01', 0.012025106076442395),
      # ('metatherian.n.01', 0.01246679759214648),
      # ('mammal.n.01', 0.013281303506525968),
      # ('marsupial.n.01', 0.013941330203709653)]
      
    • Tutorial on Poincar embeddings (Jupyter notebook).

    • Model introduction and the journey of its implementation (blog post).

    • Original paper on arXiv.

  • Optimized FastText (@manneshiva, #1742)

    • New fast multithreaded implementation of FastText, natively in Python/Cython. Deprecates the existing wrapper for Facebooks C++ implementation. ```python import gensim.downloader as api from gensim.models import FastText

    model = FastText(api.load(text8)) model.most_similar(cat)

    [('catnip', 0.8538144826889038),

    ('catwalk', 0.8136177062988281),

    ('catchy', 0.7828493118286133),

    ('caf', 0.7826495170593262),

    ('bobcat', 0.7745151519775391),

    ('tomcat', 0.7732658386230469),

    ('moat', 0.7728310823440552),

    ('caye', 0.7666271328926086),

    ('catv', 0.7651021480560303),

    ('caveat', 0.7643581628799438)]

  • Binary pre-compiled wheels for Windows, OSX and Linux (@menshikh-iv, MacPython/gensim-wheels/#7)

    • Users no longer need to have a C compiler for using the fast (Cythonized) version of word2vec, doc2vec, fasttext etc.
    • Faster Gensim pip installation
  • Added DeprecationWarnings to deprecated methods and parameters, with a clear schedule for removal.

:+1: Improvements:

  • Add Montemurro and Zanette's entropy based keyword extraction algorithm. Fix #665 (@PeteBleackley, #1738)
  • Fix flake8 E731, E402, refactor tests & sklearn API code. Partial fix #1644 (@horpto, #1689)
  • Reduce distribution size. Fix #1698 (@menshikh-iv, #1699)
  • Improve scan_vocab speed, build_vocab_from_freq method (@jodevak, #1695)
  • Improve segment_wiki script (@piskvorky, #1707)
  • Add custom dtype support for LdaModel. Partially fix #1576 (@xelez, #1656)
  • Add doc2idx method for gensim.corpora.Dictionary. Fix #1634 (@roopalgarg, #1720)
  • Add tox and pytest to gensim, integration with Travis and Appveyor. Fix #1613, #1644 (@menshikh-iv, #1721)
  • Add flag for hiding outdated data for gensim.downloader.info (@menshikh-iv, #1736)
  • Add reproducible order between Python versions for gensim.corpora.Dictionary (@formi23, #1715)
  • Update tox.ini, setup.cfg, README.md (@menshikh-iv, #1741)
  • Add optimized logsumexp for LdaModel (@arlenk, #1745)

:red_circle: Bug fixes:

:books: Tutorial and doc improvements:

:warning: Deprecations (will be removed in the next major release)

  • Remove

    • gensim.examples
    • gensim.nosy
    • gensim.scripts.word2vec_standalone
    • gensim.scripts.make_wiki_lemma
    • gensim.scripts.make_wiki_online
    • gensim.scripts.make_wiki_online_lemma
    • gensim.scripts.make_wiki_online_nodebug
    • gensim.scripts.make_wiki
  • Move

    • gensim.scripts.make_wikicorpus gensim.scripts.make_wiki.py
    • gensim.summarization gensim.models.summarization
    • gensim.topic_coherence gensim.models._coherence
    • gensim.utils gensim.utils.utils (old imports will continue to work)
    • gensim.parsing.* gensim.utils.text_utils
Other projects in Python