View on GitHub

Refinery

Refinery - A locally deployable open-source web platform for analysis of large document collections

Download this project as a .zip file Download this project as a tar.gz file

Welcome to the documentation for Refinery.

Refinery was a project supported by the Knight Foundation's prototype fund to build an NLP web application that simplifies the use of complicated NLP tools in an easy to use web interface. For the analysis of large document corpuses, Refinery provides a simple drag and drop interface along with interactive visualizations that help provide intuitive insights into your data.

How is it built?

Refinery is deployed locally using Vagrant (tested on v1.8.1) and VirtualBox (tested on v5.0). The application is highly scalable and capable of processing large document corpuses due to a Bayesian nonparametric toolbox BNPy built around the latest advances for scalable inference in types of models. Refinery can be considered a general tool for quickly discovering a set of topics which can then be leveraged to quickly isolate and extract insights into relevant documents.

Installing and Running Refinery

Refinery is a browser driven web application built primarily off of Python. It was developed with the requirement that its implementation process be as simple as possible. Refinery requires three main packages - Git, Virtualbox, and Vagrant VM. VirtualBox and Vagrant VM allows Refinery to exist within a virtual machine that is accessible through your browser. The Vagrant package allows for the deployment of a Puppet manifest, which enables the automated installation of a large number of necessary software modules. It requires approximately 2.3GB of hard drive space and a relatively fast Internet connection. Installation will run roughly 20-30 minutes of your time using a high-speed Internet connection so please keep that in mind before installing. Git is needed to clone the repository that will contain the main source code, but if you don't wish to use Git, you can always just download the zip file and uncompress it to a folder you like. However, you'll still need these two pieces of software:

To modify the installation process, the configuration file VagrantFile located within the root directory contains settings that help guide this process. Installation of Refinery is as follows from the command line:

git clone https://github.com/daeilkim/refinery.git
vagrant up

After this command, Refinery will be booting up the virtual machine and loading up the web server. You'll need to then open up any browser and go to this URL: http://11.11.11.11:8080. You should see a login screen afterwards which you can login with:

username: doc
password: refinery

To see how you Refinery works, you can watch this video which shows a basic run-through using one of the included datasets within the repository.

Further information

For machine learning, Refinery uses the BNPy package (BNPy Git Repository). The backend is supported by a PostgreSQL database used to store document corpuses and Redis for the pub/sub messaging framework used to see realtime updates. The web application is built on Python Flask, Gunicorn, Celery, and Nginx.

Logging into Vagrant

Once the repository is cloned and the command vagrant up has been executed, you'll be able to login by typing in:

vagrant ssh

This should bring you into the default home directory with the path /home/vagrant. From there, you'll have a virtual machine with a directory structure similar to a default Ubuntu installation. If you're interested in modifying anything associated with refinery, you'll find the application located in the /vagrant/refinery directory.

Topic Modeling Code

The code that executes the topic models can be found in /vagrant/refinery/webapp/topicmodel.py. For those who have worked with BNPy before, you can find the execution command on line 266 as:

hmodel = bnpy.Run.run(data, 'HDPModel', 'Mult', 'VB', doSaveToDisk=False, K=exinfo.nTopics,
                      nLap=100, initname="randomfromprior",
                      customFuncPath="refinery/webapp/", customFuncArgs=json.dumps(a))

For more information on how to configure BNPy, please refer to the developer's documentation found on their Git repository (BNPy Git Repository).

Troubleshooting Refinery

Most installation issues often are a result of outdated Vagrant links to Ubuntu images. The software was last checked to be functional as of 3/5/2016. Please contact @daeilkim if there are any installation issues.

Authors and License

Refinery is an open source project under the MIT License Copyright (C) Daeil Kim (@daeilkim)

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Packaged Software

Refinery uses several NLP packages that are also open-sourced. It primarily uses BNPy (https://bitbucket.org/michaelchughes/bnpy-dev/) licensed under BSD-3, but also the Splitta algorithm for sentence boundary detection (https://code.google.com/archive/p/splitta/) which is licensed under Apache 2.0.