Solr is one of the most common search platforms, built on the powerful indexing platform Lucene. The main feature of Solr is high-speed, scalable text search. It's most thought of as Google-like search (e.g. find me all the documents that mention 'cancer'). In addition, it supports highlighting, faceted search, spelling and fuzziness approximation, and 'more like this' searches. It supports many common data types like CSV, XML, text, Microsoft Word and PDF.
Both Solr and Lucene are written in Java, but provide Rest-like HTTP interfaces and also native support in JVM languages, Python, Ruby, PHP, C++, and others.
Solr can be scaled to large clusters to using Zookeeper, but this discussion is focused on the basic features of Solr using a single instance.
To run Solr, you'll need at least to be running Java 8. After that, download and decompress the tar file.
Navigating to the directory where Solr is decompressed, execute the following command.
By default Solr runs on 8983. Verify Solr is running by visiting the Solr Dashboard in your browser.
As mentioned above, Solr can run as single instance or as a cluster. It can be ran alongside Hadoop and is often distributed with major Hadoop installations. Solr can be used to store documents, or just used for indexing, referencing documents in another data store such as HBase or Cassandra.
The Solr Dashboard provides some metrics about your installation, logging, configuration, and administration of cores. Most Solr installations will have a custom client application built on top of Solr, with a typical (simplified) installation looking something like this.
Part of Solr's power is being able to pass it a block of text, and its ability to quickly find a specific word mentioned in that text. For that reason, it supports very flexible schemas and data types, with very few constraints. Those constraints are, each document much have a unique identifier, and each schema must have a default search field.
In additional to static fields, Solr supports dynamic fields that can be defined at index time. This allows for future-proofing data sets as new features are added.
To get going with Solr, we need to setup a Solr core. Think of a core like any data store, with one schema, storing a set of documents. A Solr installation can have multiple cores, if needed.
From the Solr home, execute the following command.
This will setup the basic config and location of our documents, but we still need to define our schema.
Although Solr provides many fields, we may want to add custom fields to tell Solr how to tokenize and store text data. In this case, we split on white space, so we can search for individual tokens easily.
Run the following:
Run the following commands to add fields to our new core.
We have some basic metadata fields for id, type, and our default search field is
Indexing is the way to get documents into Solr. Solr provides some native POST tools to load documents in Solr, for JSON, XML and CSV. Many of the native language Solr libraries also support indexing such as SolrJ. For this example, will run the following command to index a subset of the FDA Drug Label set into Solr.
See more information about indexing here.
Searching is the most common feature of Solr. Solr can be searched via REST, CURL, wget, etc. The Solr admin interface is the easiest way to begin searching.
The default query syntax is
*:* returns all documents. If you don't specify a field, Solr will search all fields. You can query using the admin tool, or using something like curl, e.g.
Solr supports different query parsers, but the default parser can support things like:
curl "http://localhost:8983/solr/reports/select?indent=on&q=vomiting&wt=json"(any mention of 'vomiting')
curl "http://localhost:8983/solr/reports/select?indent=on&q=%22birth%20defects%22&wt=json"(any mention of the phrase "birth defects")
curl "http://localhost:8983/solr/reports/select?indent=on&q=seizure%20NOT%20uncontrolled&wt=json"(mentions of 'seizure' as long as 'uncontrolled' is not present)
curl "http://localhost:8983/solr/reports/select?indent=on&q=seizure%20AND%20(%22without%20epilepsy%22%20OR%20%22no%20epilepsy%22%20OR%20%22not%20having%20epilepsy%22)&wt=json"(any mention of seizure, where at least one of the following phrases in present "without epilepsy", "no epilepsy", "not having epilepsy")
curl "http://localhost:8983/solr/reports/select?indent=on&q=%22mood%20changes%22~3&wt=json"(The word "mood" within 3 words of "changes")
curl "http://localhost:8983/solr/reports/select?indent=on&q=valproate~0.8&wt=json"(all spelling variations of "valproate" with a threshold of 0.8, scale is 0-1)
By default, a query will return the first 10 rows, but you can tweak that, using the
rows (how many to return) and
start (how far to offset the results - default 0).
All documents will return by default, in score order, basically how relevant a document is to the query. Lucene uses a Vector Space Model and a Boolean Model to determine how similar the document is to the query.
wt field is the format the data will return, the default is JSON, however you can use any of the following:
If you just want to see a limited number of fields, you can use the
fl parameter to limit the field set.
You can use the same fields in the filter query as the default query, however filter queries don't use scoring, so they may be faster and more useful in limiting result sets. Filter queries can be similar to a
WHERE clause predicates in SQL.
Solr supports many other IR and text tools other than search. They include faceting, highlighting, term vectors and term frequency, spelling suggestions, and more like this, etc. We'll cover a couple of these in depth. See a full list of all the Solr components here.
Faceting is popular aspect of Solr searching to bucket or categorize your Solr data. It's basically way to see how many unique documents match the query, by category. For example, in our data set, it's useful to see all the unique drugs and how many of each there are.
We can run the following:
We can also facet with ranges (in numeric and date fields), or use facet pivots for a decision tree like result.
The Terms component is useful to see the most common tokens in your schema, and you can limit to a specific field or range of terms. It can be useful for making suggestions to users, but it may also benefit from data cleaning, such as stop word removal, e.g.
The Term Vector component can be enabled on a field by setting
termVectors=true. It can return the term frequency, inverse document frequency, position and offset information. Term frequency-inverse document frequency or tf-idf is a common statistic used in IR, basically weighting the term to how many times it appears in a particular documents, versus the frequency of the term in the entire document set. It helps identify the importance of terms, and offsets common words like 'the', 'at', 'her', etc.
Highlights the document fragments that match the query
Recommends spelling suggestions based on the Solr index
Queries for similar documents to the search results.
SolrJ is Java API that simplifies some of the interactions with Solr, abstracting it to higher level methods. You can index with SolrJ as well. Here's a simple example in Scala.
So far, Solr is proven to be a very powerful tool for search and information retrieval. However to do real data discovery, we may need to apply other NLP techniques, such as named entity recognition (NER), or tagging text as different semantic classes. Cases in which we need this kind of deeper analysis, include:
cTAKES is an open source modular tool built out of Mayo, using Java. It's built on an Apache UIMA (Unstructured Information Management Architecture). Core cTakes components include:
Determine what components you need for your project, and setup an Aggregator, or use the pre-built ones
Or write your own cTAKES annotator
Of course many of these tasks can be done with other NLP libraries like NLTK or spaCy. These are probably faster to get started, however, they don't ship with pre-built clinical models like cTAKES.
Only do this with CAUTION. Solr will delete all the documents matching your query.
You can also delete with a post command.
You can stop Solr by running:
Use facet queries to find the most common manufacturer (organization_attr).
What are the most common drugs (subject) by this manufacturer? Look at using a filter query (fq) for limiting your search. Filter queries can improve speed and simplify your principal query (q).
Facet pivots provide a decision-tree like result to return multiple facets. How do you use facets to find the most common drugs and manufacturers?
Look at the terms handler in Solr. How do you return the top 250 terms in the warnings_section?
What common NLP techniques might be used to reduce this number to a more meaningful list?
Create a new document core using
drug_labels.json. What dynamic fields can be added an extracted out?
How do we add dynamic fields to our schema?
What is the command to use the POST tool with JSON?
EXTRA: Index data from the original source documents. Write a script to map the original fields to your Solr fields. (See sample source document
levetiracetam.xml, what additional fields would you want to capture in your index?)