Build great web search applications quickly with Solr and Blacklight Ron DuPlain*a, Dana S. Balser a, Nicole M. Radziwillba a National Radio Astronomy Observatory, 520 Edgemont Rd, Charlottesville, VA, USA 22903-2475; b James Madison University, 800 S. Main Street, Harrisonburg, VA, USA 22807 ABSTRACT The NRAO faced performance and usability issues after releasing a single-search-box ("Google-like") web application to query data across all NRAO telescope archives. Running queries with several relations across multiple databases proved to be very expensive in compute resources. An investigation for a better platform led to Solr and Blacklight, a solution stack which allows in-house development to focus on in-house problems. Solr is an Apache project built on Lucene to provide a modern search server with a rich set of features and impressive performance. Blacklight is a web user interface (UI) for Solr primarily developed by libraries at the University of Virginia and Stanford University. Though Blacklight targets libraries, it is highly adaptable for many types of search applications which benefit from the faceted searching and browsing, minimal configuration, and flexible query parsing of Solr and Lucene. The result: one highly reused codebase provides for millisecond response times and a flexible UI. Not just for observational data, NRAO is rolling out Solr and Blacklight across domains of library databases, telescope proposals, and more -- in addition to telescope data products, where integration with the Virtual Observatory is on-going. Keywords: search engine, enterprise search server, Solr, Lucene, Blacklight, telescope data archive, Java, Ruby on Rails, code reuse, open source
1. INTRODUCTION The National Radio Astronomy Observatory (NRAO) has a lot of data, including: observational data from telescopes and surveys, proposals accepted for telescope time, proceedings from various conferences, series of memos, bibliographic records of papers written by NRAO staff, and bibliographic records of papers written using NRAO data, instruments, and surveys. NRAO has been exploring techniques for building modern web-based search applications to put these data online and to improve existing search interfaces. Historically, developers at NRAO (and likely elsewhere) build query parsers in-house for relational databases. Even with early success, custom query parsers require a significant investment to build and maintain, significant in comparison to advanced open source search tools which are readily available and have active communities. In 2009, software projects at NRAO began indexing data in Solr, built on Lucene; both are open source products backed by the Apache Software Foundation. Blacklight, another open source product, provides the user interface (UI), resulting in a complete development framework for building custom search applications. The goals of this paper are to 1) describe how NRAO is building new web-based search applications, 2) convey the performance and flexibility of Lucene, Solr, and Blacklight in building custom search applications, and 3) illustrate these points by building a small but complete search application from scratch within the text of this paper.
Apache httpd web server
Blacklight mod rails
Web UI (Ruby on Rails)
search query HTTP request HTTP response
Solr Lucene
indexed
Data from database, flat files, or elsewhere…
(Java)
Figure 1 Interactions between various systems to provide a production web-based search application. *
[email protected]; phone +1 434-244-6845; www.nrao.edu
2. SELECTED DATA COLLECTIONS AT NRAO As an observatory, NRAO has many data collections which warrant online search interfaces. The NRAO Office of End to End Operations and the NRAO Library regularly receive requests for access (or improvements to existing access) for the following data collections. These data collections are independent of each other and altogether use a wide variety of database tools and flat file formats. 2.1 Telescope data When observers collect data on NRAO telescopes, they have a proprietary period (usually twelve months) to process and publish results. When the proprietary period ends, data are public and are often useful for archival research. Since 2003, NRAO has been developing an online search and access tool for both public and proprietary data, given authentication. With three currently active public telescopes and one under development, the current search tool only includes data from two telescopes. The data from the telescopes are a collection of stand-alone files formatted in FITS,1 with metadata managed in two different relational database management systems: Oracle and PostgreSQL. 2.2 Telescope proposals NRAO operates public instruments and has 3 calls for proposals throughout the year. Once accepted, a proposal coversheet circulates to telescope and observatory operations as needed. An electronic submission tool accepts new proposals and stores recent proposals in MySQL. NRAO has no central proposal database for historical proposals, and currently there is no public portal to access proposals online. 2.3 NRAO papers The NRAO library maintains a collection of bibliographic records for publications which use data from NRAO telescopes or have authors who are NRAO staff, scientific or technical. NRAO has been developing a public search interface in PHP and a separate administrative interface for library-use only. These bibliographic records currently reside in MySQL. 2.4 NRAO theses database The NRAO library maintains a collection of bibliographic records relating to graduate theses written using NRAO instruments. This database migrated several times in the past 4 years and currently resides in Microsoft Access. 2.5 ISSTT proceedings The NRAO library manages the collection of papers from all proceedings of the International Symposium on Space Terahertz Technology (ISSTT). NRAO is digitizing and hosting all of proceedings to date, totaling more than 20 years of meetings. The proceedings are a collection of scanned-in PDFs, with metadata in MySQL. 2.6 Other NRAO is also considering search tools for various other collections, such as technical memos and meeting minutes.
3. LUCENE, SOLR, AND BLACKLIGHT Lucene provides a base for advanced search indexing, which Solr packages into a standalone server with a RESTful API, to which Blacklight provides a Web user interface. Figure 1 provides an example overview of these systems in action. 3.1 Lucene Lucene Java is a Java library for adding powerful search indexing to your application. Lucene provides advanced indexing, query parsing, tokenization, spellchecking, and hit highlighting.2 3.2 Solr Solr is a stand-alone search server built on Lucene, where developers need no knowledge of Lucene Java itself for most use cases.3 Customizing Solr is often just a matter of managing XML configuration files. Solr runs in a Java servlet container listening for queries by HTTP using a REST pattern. Solr responds by default in XML, supports common formats such as JSON, Python, and Ruby out of the box, and provides hooks for custom response formats. Solr accepts
updates to its index in a variety of ways, including (but not limited to) XML or JSON over HTTP, CSV over HTTP or on a local filesystem, native Java interfaces, or any one of Solr’s built-in data import handlers.4 3.3 Blacklight Blacklight is an open source Ruby on Rails project which provides a web user interface to Solr.5 The project has an actively growing community, led by libraries at the University of Virginia and Stanford University. Blacklight’s original design provides an Online Public Access Catalog (OPAC) out of the box, which works with library catalog records such as those in the machine readable catalog (MARC) format. Blacklight applies generically and not just to catalog records; it uses a simple configuration file to match any Solr† schema. Customizing the Blacklight user interface (UI) is just like customizing any Ruby on Rails project.6 Developers can build complete applications with branded UIs with just a few edits to the Blacklight configuration file and to project templates and style sheets. Extensive customization and integration with other systems are both possible with the Ruby on Rails framework and are not specific to Blacklight. Stanford University has a Blacklight OPAC in production which demonstrates that which Blacklight is originally intended to do: Online Public Access Catalog.8
Figure 2 Stanford University’s SearchWorks, a Blacklight application serving as an OPAC. Users enter queries in the single search box at the top of the page and limit searches using facet links at the left of the page. As a Ruby on Rails project, Blacklight is customizable; Stanford provides a tag cloud and recent news on the front page. NRAO has been developing with Blacklight since mid 2009 and launched two applications in early 2010: NRAO Theses Database9 and ISSTT Proceedings10. Developments have been ongoing for building search tools for telescope data, telescope proposals, and the wider collection of papers relating to NRAO. Section 2 describes these data collections.
†
Since Solr provides a useful abstraction of Lucene, this paper refers to Solr and Lucene as simply “Solr.”
Figure 3 NRAO’s first Blacklight application,9 a simple yet complete search tool querying across all graduate theses written using NRAO instruments or surveys.
4. AN ITERATIVE APPROACH TO BUILDING A SEARCH TOOL NRAO’s process has three distinct phases in building search applications with Solr and Blacklight, followed by a deployment phase. 4.1 Phase I: data definition A search tool requires its designer to know the records and fields in a data collection, which fields are for query, and which fields are for retrieval (or both query and retrieval). Well-defined data is that which a document is able to describe with minimal ambiguity. That is, well-defined data has a schema (whether or not that schema is already available). The purpose of this phase is to understand that schema, perhaps modify it according to search use cases, and to clarify the policies which govern the corresponding data. For familiar data, this phase takes minutes; otherwise, this phase can take hours or days. If policies on use of data are not clear, development of search tools are likely to get stuck in this phase. 4.2 Phase II: initialization of Solr and Blacklight This phase starts with installation of Blacklight‡ and Solr and their required dependencies, and continues with configuration. Solr has an XML configuration file named schema.xml which specifies input data using a variety of types including (but not limited to) character strings, text, Booleans, integers, floating point numbers, dates, and custom types. Blacklight has a simple configuration file called solr.yml to point to the correct Solr server and a Ruby configuration file called blacklight_config.rb to match the Solr server’s schema and humanize its fields. With a configuration in place, the next step in the phase is to ingest the data. Whatever the input data, Solr tracks individual records simply called documents. If the input data is in a relational database, an early step is to denormalize ‡
As of version 2.4, the Blacklight installer also installs Solr. Blacklight users may not need to install Solr separately.
the database. An example of denormalization is to pack all fields in a relational database into a single table. Denormalization is not a best practice in relational database design, but it is a best practice in indexing for search. Input data need not come from a relational database; input data can come from anywhere as long as it can fit to Solr’s document model matching the schema which schema.xml specifies. This phase concludes with a functional prototype of Blacklight hitting a live Solr server with an index of relevant data. 4.3 Phase III: customization and tuning This phase involves customizing the web UI like any other Ruby on Rails project and tuning search handling and relevance in Solr’s solrconfig.xml, an XML configuration file which specifies search handlers and other Solr parameters. A first pass through this phase would, for example, adjust fonts, page widths and other style elements, and change the colors and logos of Blacklight to match those of the organization. Solr defaults are often good enough for early development. This phase is inherently creative and as such is open-ended. 4.4 Deploy The development servers for Blacklight and Solr provide a means to interact with the latest version of the search application, but are not suitable for production environments. This part of the iteration releases the latest version of the search application for regular use. Section 7 provides further discussion on deployment. 4.5 Repeat A functional search tool provides insight into improving a data schema, building a better search index, tuning search relevance, and prioritizing UI features. That is, a functional search tool provides insight into building a better search tool – building a search tool is an iterative process. Solr and Blacklight enable fast iteration since configuration files add, remove, or tune many features. For various applications at NRAO, cycles to build new application versions were often shorter than a day, and in some cases, one hour or less. To support this claim, the remaining sections will provide commands, code, and configuration which build a fully functional search application from scratch.
5. SEARCHING DATA WITH SOLR To provide unambiguous illustration on how to use Lucene, Solr, and Blacklight to build a custom search application, this section and the remaining sections provide instructions on building an example application on a host at example.org with a freshly installed instance of Ubuntu 10.04 (Lucid Lynx), i386 server edition.§ The commands in this section and the following section require two terminal sessions on the development host. The command-line here uses the bash shell with prompt: $ 5.1 Installing dependencies for Solr Solr requires Java. For this fresh installation of Ubuntu, the package manager apt-get installs a Java Runtime Environment and the curl utility, which is useful for interacting with web services from the command line. $ sudo apt-get update $ sudo apt-get install openjdk-6-jre-headless curl
5.2 Downloading, unpacking, and running Solr Solr is available for download from one of Apache’s mirrors listed on the download page.3 $ wget http://apache.osuosl.org/lucene/solr/1.4.0/apache-solr-1.4.0.tgz $ tar -xvzf apache-solr-1.4.0.tgz
After unpacking, Solr’s example package demonstrates whether everything is working. $ cd apache-solr-1.4.0/example $ java -jar start.jar # generates a lot of output ... ... Started SocketConnector @ 0.0.0.0:8983 §
Things change. Readers should follow this example at their own risk.
5.3 Defining a schema Solr has a tutorial in its documentation, complete with document interaction.13 This example instead uses data from data.gov on NSF Graduate Research Fellowship Program Award Recipients.14 Data on this award are available back to 2001 on data.gov; the code here uses 2009 as an example. The download location redirects to GRFP_Award_2009.csv. $ wget http://www.data.gov/download/2018/csv
Solr’s schema provides for dynamic fields which match on simple wildcards. Dynamic fields are useful in prototyping exercises where fieldnames and types might change throughout the course of development, and are acceptable for production use, too. The following adjustments to Solr’s example schema in solr/conf/schema.xml (between and ) provide a working schema for the award recipient data. id text
A restart of the Solr server will load the new schema. $ java -jar start.jar ... ... Started SocketConnector @ 0.0.0.0:8983
5.4 Loading data into Solr Most programming languages provide a means to interact with a Solr server once it is up and running. The Python programming language provides such an environment for interacting with a Solr server, using a third-party package called pysolr. Python has a tool called pip to install third-party packages, and apt-get provides pip. $ sudo apt-get install python-pip $ sudo pip install distribute $ sudo pip install pysolr
A quick Python script massages the input data and adds it to the running Solr server. Python provides a useful mapping to Solr’s index. The pysolr.Solr.add method accepts a list of Python dictionaries with keyword-value pairs with values of limited complexity (mostly primitives, but types like datetime work, too). Solr accepts multiple values per key if the multiValued attribute is set in the schema. If a script can fit data into a list of dictionaries in Python, it can load that data into Solr. import csv import pysolr # Connect to the Solr server. solr = pysolr.Solr('http://localhost:8983/solr') # Note: CSV file does not include an ID field, so create one. # id format: yyyyxxxx # yyyy is 4-digit year # xxxx is an auto-increment counter current_id = 20090001
for record in csv.DictReader(open('GRFP_Award_2009.csv')): doc = {} # File has Latin1 encoding, not ASCII. Decode each field. for key, value in record.items(): record[key] = value.decode('latin1') doc['name'] = record['Name'] doc['field'] = record['Field of Study'] doc['institution'] = record['Current Institution'] doc['undergrad'] = record['Baccalaureate Institution'] doc['grad'] = record['Graduate Institution'] doc['email'] = record['Email Address'] # Copy each field to field_s, for indexing as string. for key, value in doc.items(): doc[key + '_s'] = value # Create the ID field. doc['id'] = current_id current_id += 1 # Add the document, but don't commit all documents until later. solr.add([doc], commit=False) # Commit all documents to the index. solr.commit()
With Solr running, calling Python on this script will upload the 1248 records in the input CSV in a matter of seconds. The curl command provides a quick way to verify that data are loaded into the Solr schema. Solr’s output to the terminal is essential in understand errors in this step. When this script completes, Solr has data in its index and is ready to go. $ python awards2009.py $ curl 'http://localhost:8983/solr/select/?q=systems+engineering&indent=on&wt=json' { "responseHeader":{ "status":0, "QTime":4, "params":{ "indent":"on", "wt":"json", "q":"systems engineering"}}, "response":{"numFound":2,"start":0,"docs":[ { "name":"Donohoo, Pearl", "undergrad":"Franklin W. Olin College of Engineering", "institution":"Massachusetts Institute of Technology", "field":"Engineering - Systems Engineering", "id":"20090269", "grad":"Massachusetts Institute of Technology", "email":"Email Address Not Available for Publication"}, { "name":"Malecha, Gregory M", "undergrad":"William Marsh Rice University", "institution":"Harvard University School of Engineering and Applied Sciences", "field":"Comp/IS/Eng - Computer Science - Languages and Systems", "id":"20090693", "grad":"Harvard University", "email":"Email Address Not Available for Publication"}] }}
6. PROVIDING A COMPLETE WEB USER INTERFACE WITH BLACKLIGHT Solr’s default responses are fine if users want to get raw XML or JSON, but at this point most institutions would likely wish to build a full user interface in front of Solr. 6.1 Installing dependencies for Blacklight The commands below install the Ruby interpreter, git which will retrieve Blacklight files, and the gem tool which manages third-party packages for Ruby. $ sudo apt-get install ruby-full git-core libsqlite3-ruby $ wget http://rubyforge.org/frs/download.php/70696/rubygems-1.3.7.tgz
$ $ $ $ $ $ $ $ $
tar -xvzf rubygems-1.3.7.tgz cd rubygems-1.3.7 sudo ruby setup.rb sudo ln -s /usr/bin/gem1.8 /usr/bin/gem cd gem sources -a http://gems.rubyforge.org gem sources -a http://gems.rubyonrails.org gem sources -a http://gems.github.com gem sources -a http://gemcutter.org
With gem installed the following command installs Rails, i.e. Ruby on Rails. $ sudo gem install -v=2.3.4 rails # ignore rdoc errors
6.2 Installing Blacklight Blacklight provides a template installer which sets up a new project, in this case in the blacklight-app directory. $ rails ./blacklight-app –m \ http://github.com/projectblacklight/blacklight/raw/v2.5.0/template.rb ... Would you like to install the gem dependecies now? yes Do you want to install gems using sudo? yes Would you like to run the initial database migrations now? yes ... Would you like to install and configure Apache Solr now? no ... $
6.3 Running Blacklight The solrconfig.xml in Solr’s example does not work with Blacklight; this XML below works with the schema above. ${solr.abortOnConfigurationError:true} ${solr.data.dir:./solr/data} *:* explicit *:* dismax explicit *:* explicit *,score true 1 true explicit *
1 {!raw f=id v=$id} engineering
A restart of the Solr server will load the new configuration. $ java -jar start.jar # generates a lot of output ... ... Started SocketConnector @ 0.0.0.0:8983
Rails provides a script to start the Blacklight server. From the blacklight-app directory: $ ./script/server # generates a lot of output ... ... INFO WEBrick::HTTPServer#start: pid=13378 port=3000
At this point, the Rails server is running the Blacklight application, but no information will display until the Blacklight configuration matches the Solr schema and configuration. In blacklight-app/config/initializers/blacklight_config.rb: Blacklight.configure(:shared) do |config| # default params for the SolrDocument.find_by_id method SolrDocument.default_params[:find_by_id] = {:qt => :document} config[:default_qt] = "search" # solr field values given special treatment in the show (single result) view config[:show] = { :html_title => "name", :heading => "name", :display_type => "format" } # solr field values given special treatment in the index (search results) view config[:index] = { :show_link => "name", :num_per_page => 10, :record_display_type => "format" } # solr fields that will be treated as facets by the blacklight application config[:facet] = { :field_names => [ "institution_s", "grad_s", "undergrad_s", ], :labels => { "institution_s" => "Institution", "grad_s" => "Grad", "undergrad_s" => "Undergrad", }, # Setting a limit will trigger Blacklight's 'more' facet values link. # If left unset, then all facet values returned by solr will be displayed. # nil key can be used for a default limit applying to all facets otherwise # unspecified. :limits => { nil => 10 } } # solr fields to be displayed in the index (search results) view config[:index_fields] = { :field_names => [
"field", "institution", ], :labels => { "field" => "Field:", "institution" => "Institution:", } } # solr fields to be displayed in the show (single result) view config[:show_fields] = { :field_names => [ "field", "institution", "grad", "undergrad", "email" ], :labels => { "field" => "Field:", "institution" => "Institution:", "grad" => "Grad:", "undergrad" => "Undergrad:", "email" => "Email:" } } config[:search_fields] ||= [] config[:search_fields] 'All Fields', :qt => 'search'} config[:sort_fields] ||= [] config[:sort_fields]