January 03, 2014

Extracting n word phrases in large texts

This is a summary of resources posted on [Corpora-List] early 2014

CMU-Cambridge Statistical Language Modeling toolkit

Sketch Engine

Lawrence Anthony's AntConc



Software for the extraction of n-grams as well as patterns that are not consecutive (skipgrams). The software is written in C++ for speed and memory efficiency but comes with a Python binding for usage from Python script. It also has a standalone CLI tool that can do what you want. f

Maarten van Gompel

GnuPG key: 0x1A31555C  XMPP: