Saturday, August 15, 2015

Oslers Library (oslersLibrary), a free full-text medical search tool, web-hostable.

We are going to do some information retrieval in the medical domain. This will be a long and ongoing project with many experiments. I am no expert in information retrieval (though as a physician I know what right answers look like in the medical domain) so we will learn together and produce something useful to the clinician, a complete product.

There exists a very large trove of PDF medical data in circulation, both legal as well as not (medical book pdf's available as torrents; not that I am endorsing this, but they exist and many use them). However you personally acquire your own medical library, it likely could benefit from free-text search. That is the thrust of the first part of this project. Subsequent parts will expand our product with cutting edge informatics research articles actualized as new features (for example, random walks over the UMLS for automatic query expansion).

Why create our own medical library? To write our own algorithms of course (though speed, selection, and accessibility are big too, as well as control and privacy).

So this first article establishes our baseline, a shell, though fully functional and useful; essentially an html wrapper around elasticsearch. 

Birds eye view of the tech: 
Full-text search on backend (elasticsearch)
html5/jquery/django frontend with authentication
python as the language of choice
python-elasticsearch for our elasticsearch bindings
recommend apache2/wsgi for deployment (not covered, beyond scope)

Choice 1, full text search
When it comes to full-text search, one of my major decision points was Solr vs Elasticsearch. Honestly, I had only used Solr before, but after sniffing around Elasticsearch awhile I was hooked. Im not saying it was an exhaustive search, but what appealed to me the most was the easy setup, lightning speed, the vibrant community, and the json driven RESTful API.

Choice 2, python, django, bootstrap 
Im comfortable in java but python is so fast and django is so production ready and rapid, this was an easy choice. Web apps, to me, seem to be the new gui platform ... and unless you have a need for full on apple-store style apps (in which case your going to have to work through their objective C, doable, but then you're likely to still need a web interface), the web gui has the double edge of giving a web presence as well as iphone/ipad friendliness (when you use a responsive grid system like bootstrap). 

So, in summary, this is effectively a semi-production (home production anyways) capable, pretty web-gui with authentication, wrapper around elasticsearch, a world-class search engine employed across industry. It gives us a great starting point for implementing experiments through adding features to a clinical/hospital accessable medical library. 

One quick note: The design of this was fast and dirty. I patched up some holes for this at home production but things like unit testing went out the window, a cornerstone of good development. If anyone wants to write them for me, please, by all means... but, I'm on call every other day and I've got to do what I can when I can.

Install instructions (you need some skills doctor):

Checkout and install elasticsearch:

On Mac:
cd /usr/local 
sudo tar xvzf elasticsearch-1.6.0.tar.gz
cd elasticsearch-1.6.0

On Ubuntu 14: 
cd /usr/local 
sudo tar xvzf elasticsearch-1.6.0.tar.gz
cd elasticsearch-1.6.0

On ubuntu, there is a good script here and here that sets it up as a service. 

Use virtual env. If your unfamiliar, see here. Also, we use django, read about it here and here.

Sudo pip install the following packages: 
Django==1.8.2
PyPDF2==1.24
elasticsearch==1.5.0
urllib3==1.10.4
uuid==1.30
wsgiref==0.1.2

Download the project from my github:
git clone https://github.com/jtgreen/oslersLibrary.git

Rename settings.py_scrubbed to setting.py under oslersLibrary.

Generate a new secret key, and past it into the single quotes under SECRET_KEY:
''.join([random.SystemRandom().choice('abcdefghijklmnopqrstuvwxyz0123456789!@#$%^&*(-_=+)') for i in range(50)])

I deleted the local sqlite db, which is just used for accounts and sessions, so recreate it. Within the context of the virtual lib and inside the directory "oslersLibrary":

python manage.py migrate

you should see: 

Operations to perform:
  Apply all migrations: admin, contenttypes, auth, sessions
Running migrations:
  Applying contenttypes.0001_initial... OK
  Applying auth.0001_initial... OK
  Applying admin.0001_initial... OK
  Applying sessions.0001_initial... OK

Now lets import some data. The data is stored in the data dir under oslersLibrary's static dir. So mkdir static/data and copy all your pdf's into it. Every pdf that the script were about to run finds will be split into pages, ocr'ed, inputed into elasticsearch, then cleaned up (intermediate files). This is CPU intensive! 

All of the above will be placed into a directory such that static/data will become static/data/my_file with the components necessary for the app in the dir my_file. 

Eventually Ill add a drag and drop interface to the web gui. 

Eventually Ill have time for some good programming practices like Unit Tests :-/ 

One last thing, create an admin to log in with:

python manage.py createsuperuser

You can now login to the library this way. You can create new users at the /admin endpoint. 

That's it! 

Run the django test server like so:

python manage.py runserver

or if you want to specify a port or access from another computer (not localhost):

python manage.py runserver 0.0.0.0:8000

Deployment is beyond the scope of this blog post. Good articles can be found easily with google. I recommend Apache2 with WSGI. 

Questions for help or comments for improvement are welcome!


Again, remember... this is a laboratory of sorts. A launchpad for future experiments AND a useful tool to run from your home server (I recommend dyndns) to access your personal medical library.

Enjoy!



No comments:

Post a Comment