Saturday, August 15, 2015

Oslers Library (oslersLibrary), a free full-text medical search tool, web-hostable.

We are going to do some information retrieval in the medical domain. This will be a long and ongoing project with many experiments. I am no expert in information retrieval (though as a physician I know what right answers look like in the medical domain) so we will learn together and produce something useful to the clinician, a complete product.

There exists a very large trove of PDF medical data in circulation, both legal as well as not (medical book pdf's available as torrents; not that I am endorsing this, but they exist and many use them). However you personally acquire your own medical library, it likely could benefit from free-text search. That is the thrust of the first part of this project. Subsequent parts will expand our product with cutting edge informatics research articles actualized as new features (for example, random walks over the UMLS for automatic query expansion).

Why create our own medical library? To write our own algorithms of course (though speed, selection, and accessibility are big too, as well as control and privacy).

So this first article establishes our baseline, a shell, though fully functional and useful; essentially an html wrapper around elasticsearch. 

Birds eye view of the tech: 
Full-text search on backend (elasticsearch)
html5/jquery/django frontend with authentication
python as the language of choice
python-elasticsearch for our elasticsearch bindings
recommend apache2/wsgi for deployment (not covered, beyond scope)

Choice 1, full text search
When it comes to full-text search, one of my major decision points was Solr vs Elasticsearch. Honestly, I had only used Solr before, but after sniffing around Elasticsearch awhile I was hooked. Im not saying it was an exhaustive search, but what appealed to me the most was the easy setup, lightning speed, the vibrant community, and the json driven RESTful API.

Choice 2, python, django, bootstrap 
Im comfortable in java but python is so fast and django is so production ready and rapid, this was an easy choice. Web apps, to me, seem to be the new gui platform ... and unless you have a need for full on apple-store style apps (in which case your going to have to work through their objective C, doable, but then you're likely to still need a web interface), the web gui has the double edge of giving a web presence as well as iphone/ipad friendliness (when you use a responsive grid system like bootstrap). 

So, in summary, this is effectively a semi-production (home production anyways) capable, pretty web-gui with authentication, wrapper around elasticsearch, a world-class search engine employed across industry. It gives us a great starting point for implementing experiments through adding features to a clinical/hospital accessable medical library. 

One quick note: The design of this was fast and dirty. I patched up some holes for this at home production but things like unit testing went out the window, a cornerstone of good development. If anyone wants to write them for me, please, by all means... but, I'm on call every other day and I've got to do what I can when I can.

Install instructions (you need some skills doctor):

Checkout and install elasticsearch:

On Mac:
cd /usr/local 
sudo tar xvzf elasticsearch-1.6.0.tar.gz
cd elasticsearch-1.6.0

On Ubuntu 14: 
cd /usr/local 
sudo tar xvzf elasticsearch-1.6.0.tar.gz
cd elasticsearch-1.6.0

On ubuntu, there is a good script here and here that sets it up as a service. 

Use virtual env. If your unfamiliar, see here. Also, we use django, read about it here and here.

Sudo pip install the following packages: 
Django==1.8.2
PyPDF2==1.24
elasticsearch==1.5.0
urllib3==1.10.4
uuid==1.30
wsgiref==0.1.2

Download the project from my github:
git clone https://github.com/jtgreen/oslersLibrary.git

Rename settings.py_scrubbed to setting.py under oslersLibrary.

Generate a new secret key, and past it into the single quotes under SECRET_KEY:
''.join([random.SystemRandom().choice('abcdefghijklmnopqrstuvwxyz0123456789!@#$%^&*(-_=+)') for i in range(50)])

I deleted the local sqlite db, which is just used for accounts and sessions, so recreate it. Within the context of the virtual lib and inside the directory "oslersLibrary":

python manage.py migrate

you should see: 

Operations to perform:
  Apply all migrations: admin, contenttypes, auth, sessions
Running migrations:
  Applying contenttypes.0001_initial... OK
  Applying auth.0001_initial... OK
  Applying admin.0001_initial... OK
  Applying sessions.0001_initial... OK

Now lets import some data. The data is stored in the data dir under oslersLibrary's static dir. So mkdir static/data and copy all your pdf's into it. Every pdf that the script were about to run finds will be split into pages, ocr'ed, inputed into elasticsearch, then cleaned up (intermediate files). This is CPU intensive! 

All of the above will be placed into a directory such that static/data will become static/data/my_file with the components necessary for the app in the dir my_file. 

Eventually Ill add a drag and drop interface to the web gui. 

Eventually Ill have time for some good programming practices like Unit Tests :-/ 

One last thing, create an admin to log in with:

python manage.py createsuperuser

You can now login to the library this way. You can create new users at the /admin endpoint. 

That's it! 

Run the django test server like so:

python manage.py runserver

or if you want to specify a port or access from another computer (not localhost):

python manage.py runserver 0.0.0.0:8000

Deployment is beyond the scope of this blog post. Good articles can be found easily with google. I recommend Apache2 with WSGI. 

Questions for help or comments for improvement are welcome!


Again, remember... this is a laboratory of sorts. A launchpad for future experiments AND a useful tool to run from your home server (I recommend dyndns) to access your personal medical library.

Enjoy!



Tuesday, July 14, 2015

Naive Bayes classifier for english words / generic drugs

Looking to post a full text medical search application, ready to deploy, soon. In the meantime ...

Brushing off some skills and piggy backing on an article here I thought I'd show my fellow clinicians a quick use of the NLTK's naive bayesian classifier for classifying a word as an english word or as a generic drug.

Good description of the concept of a naive bayesian classifier here.

From a directory for a small one off code example:

cat /usr/share/dict/words > ./data/words.txt

Download into the same directory the drugbank freely available here and unzip it.

Both of the below files can be coped from this gist here and here.

Create a file called parse_drugbank.py with:

# Source: parse_drugbank.py
# -*- coding: utf-8 -*-
import os
import xml.sax

DATA_DIR = "./data/"

class DrugXmlContentHandler(xml.sax.ContentHandler):
    
    def __init__(self):
        xml.sax.ContentHandler.__init__(self)
        self.tags = []
        self.generic_names = []
        self.brand_names = []
        
    def startElement(self, name, attrs):
        self.tags.append(name)
    
    def endElement(self, name):
        self.tags.pop()
        
    def characters(self, content):
        breadcrumb = "/".join(self.tags)
        #if breadcrumb == "drugbank/drug/products/product/name":
        #    self.brand_names.append(content)
        if breadcrumb == "drugbank/drug/name":
            self.generic_names.append(content)
    
def write_list_to_file(lst, filename):
    fout = open(os.path.join(DATA_DIR, filename), 'wb')
    for e in lst:
        fout.write("%s\n" % (e.encode("utf-8")))
    fout.close()

    
source = open(os.path.join(DATA_DIR, "drugbank.xml"), 'rb')
handler = DrugXmlContentHandler()
xml.sax.parse(source, handler)
source.close()

write_list_to_file(handler.generic_names, "generic_names.txt")
#write_list_to_file(handler.brand_names, "brand_names.txt")

Run it.

Create a file called wordDrugClassify.py with the following, after pip install'ing the appropriate packages (at the top):

# Source: wordDrugClassify.py
# -*- coding: utf-8 -*-
# Thanks to Sujit Pal at sujitpal.blogspot.com
import matplotlib.pyplot as plt
import nltk
import numpy as np
import os
import string
from operator import itemgetter
import random

GRAM_SIZE = 3

def word2ngrams(text, n=3, exact=True):
   """ Convert text into character ngrams. """
   return ["".join(j) for j in zip(*[text[i:] for i in range(n)])]

def is_punct(c):
    return c in PUNCTS
    
def is_number(c):
    return c in NUMBERS
    
PUNCTS = set([c for c in string.punctuation])
NUMBERS = set([c for c in "0123456789"])

def str_to_ngrams(instring, gram_size):
    ngrams = []
    for word in nltk.word_tokenize(instring.lower()):
        try:
            word = "".join(["S", word, "E"]).encode("utf-8")
            cword = [c for c in word if not(is_punct(c) or is_number(c))]
            ngrams.extend(["".join(x) for x in nltk.ngrams(cword, gram_size)])
        except UnicodeDecodeError:
            pass
    return ngrams

def ngram_distrib(words, gram_size):
    tokens = []
    for word in words:
        tokens.extend(str_to_ngrams(word, gram_size))
    return nltk.FreqDist(tokens)
    
def plot_ngram_distrib(fd, nbest, title, gram_size):
    kvs = sorted([(k, fd[k]) for k in fd], key=itemgetter(1), reverse=True)[0:nbest]
    ks = [k for k, v in kvs]
    vs = [v for k, v in kvs]
    plt.plot(np.arange(nbest), vs)
    plt.xticks(np.arange(nbest), ks, rotation="90")
    plt.title("%d-gram frequency for %s names (Top %d)" % 
              (gram_size, title, nbest))
    plt.xlabel("%d-grams" % (gram_size))
    plt.ylabel("Frequency")
    plt.show()
   
###

with open("./data/words.txt") as f:
   eng_words = f.read().split() 

with open("./data/generic_names.txt") as f:
   generic_names = f.read().split()
   
eng = ngram_distrib(eng_words, GRAM_SIZE)
generic = ngram_distrib(generic_names, GRAM_SIZE)

plot_ngram_distrib(eng, 30, "Eng words", GRAM_SIZE)
plot_ngram_distrib(generic, 30, "Generic drugs", GRAM_SIZE)

###

words = ([(word, 'engWord') for word in eng_words] +
   [(word, 'genericDrug') for word in generic_names]
random.shuffle(words)

###

train_words = words[1500:]
devtest_words = words[500:1500]
test_words = words[:500]

###

def word_features(word):
   features = {}

   i = 0
   for ngram in str_to_ngrams(word, 3):
 features["ngram"+str(i)] = ngram
 i+=1   

   return features

###

train_set = [(word_features(w), c) for (w,c) in train_words]
devtest_set = [(word_features(w), c) for (w,c) in devtest_words]
test_set = [(word_features(w), c) for (w,c) in test_words]

classifier = nltk.NaiveBayesClassifier.train(train_set)

###

errors = []
for (word, tag) in devtest_words:
    guess = classifier.classify(word_features(word))
    if guess != tag:
        errors.append( (tag, guess, word) )
        
for (tag, guess, word) in sorted(errors):
    print 'correct=%-8s guess=%-8s name=%-30s' % (tag, guess, word)

###

if __name__ == '__main__':
   while True:
 print
 print "% accurate: ",
 print nltk.classify.accuracy(classifier, devtest_set)
 print "Hit enter with no input to quit."
 
 query = raw_input("Query:")
 if query == '':
 break
 else:
print classifier.classify(word_features(query))

As you can see, the error analysis shows us >96% on our dev-test set... this example doesnt even bother moving on to the test set. 

With just some simple features, here the ngrams, we were able to predict with a high degree of accuracy whether a word was a generic drug or an english word.

--JG


Tuesday, June 9, 2015

NER, etc

As an initial post, I will refer the reader to a small open source project I published to github recently (https://github.com/jtgreen/SMPP). It's fairly self explanatory in the readme.md, though a discussion of thought process, direction, or use is welcome here.

JG

Welcome!

Hi. I'm a physician and I've also been programming 20+ years. Medicine can benefit from computers. Period.

If you ever have any questions feel free to reach out here on the blog. 

-- Best,

John T Green