Search your Dropbox Files with Solr and Node

Mike Lenz · June 27, 2013 · @Galler.io
Looking for that apricot jam recipe you lost in your Dropbox folder months ago? Not to worry, a robust search engine for your files can be built quite easily. For this project we'll use Node.js, the Solr search platform, and the Dropbox APIs. Features include full content search of multiple document types, real-time index updates, some handy search shortcuts, and a search results page with snippets and ajax navigation.
Here's a screenshot of search in action:
Some examples of search shortcuts:
recipeSearch for text (case and stemming aware)
"cake recipe"Phrase search
jam recipe in:FilesMatch documents within a folder
when:yesterdayAll documents modified yesterday
recipe when:2012Matches from year 2012
by:MikeMatch the given author
dogs type:imageReturn only images
where:40"47'Match lat/long in image metadata
The full code discussed below is available in the Dropbox Search project on GitHub. That page has full details on setup and environment variables which are not covered here.

First, set up Solr

Installing Solr is outside the scope of this document, but in short, follow the tutorial for Solr 3.6.2. (Solr 4 is also available but not yet tested with this project.) Before running the server, you should make a few additions to the solr/conf/schema.xml file in your solr/example directory. These changes are detailed in the GitHub project in solr/schema.xml and enable two features: storage (not just indexing) of file contents, to support snippet highlighting; and definitions of new fields, such as path, mime_type, and rev, which are used by our search schema. Finally, run Solr and you should have an active instance on the default port at http://localhost:8983/solr.

Next, index your files

The code in indexer.js runs as a daemon process and does the work of adding your files to your Solr instance. Here are the steps involved.
  1. Listen for delta events. The Dropbox delta API provides a list of changes (additions, edits and deletions) to your files in the cloud since a certain event, or since the beginning of time. This allows us to initialize our index with a stream of events, each of which corresponds to a file to be added, modified, or removed from the index. It also means we can keep our indexer running in the background and as you add new files to your Dropbox, they will automatically be indexed within minutes. Here's a snippet:
    dboxClient.delta({cursor : deltaCursor}, function(status, reply) {
        if (status != 200) {
            // error handling
        }
        // remember where we are for the next call
        deltaCursor = reply.cursor;
    

    Each delta entry is a [file_path, metadata] tuple. If the metadata is absent, this represents a deletion of the file; otherwise it's a new or edited file. I don't process edits immediately; instead, I place them in a queue which I process at a rate of one per second. This prevents us from fetching a large number of files from Dropbox at once, which can lead to rate-limiting 503 responses.

        for (i = 0; i < reply.entries.length; i++) {
            entry = reply.entries[i];
            path = entry[0];
            metadata = entry[1];
            if (!metadata) {
                deleteFile(path);
            }
            else if (!metadata.is_dir) {
                fileQueue.push(entry);
            }
        }
    

    Note: For details on working with OAuth to enable access to the API, see my authentication post.

  2. Add each file to Solr. In a separate loop, we pop entries from the queue, download the file contents from Dropbox, then pass the contents and metadata to Solr for indexing:
    dboxClient.get(path, function(status, contents, metadata) {
        var query = querystring.stringify({
            "literal.id": path,
            "literal.rev": metadata.rev,
            "literal.when": dateFormat(metadata.client_mtime, "isoDateTime") + "Z",
            // ... various other fields here
        });
        var options = {
            method: "POST",
            path: "/solr/update/extract?" + query,
            data: contents
        };
        solrClient.request(options, function(err) {
            // ...
        });
    

    We call the Dropbox get method to fetch contents and metadata of each file, then pass that data to Solr's extract method. That method invokes the Apache Tika project to parse the contents of many popular file formats, including pdf, html, rtf, odf, Office files (doc, xls, ppt), and more. For binary files, Tika extracts common metadata, such as gps coordinates of images.

    Each document is added to the index with its file path as a unique identifier. We also write additional custom fields such as the Dropbox revision tag, last-modified date, type, size, as well as metadata fields extracted by Tika and specified in our schema.xml, such as author and title.

  3. Repeat. As long as the indexer is running, it will capture ongoing delta events in a timer and add, modify, or remove documents as needed. Note that in the case of a modified file, the id (path) is unchanged, so Solr treats the update as a rewrite of an existing document. As an optimization, we can also query for a document matching "id:<path> AND rev:<rev>" for each path revision returned by the delta call, and skip indexing in case of a match; this lets us re-run the indexer from time zero without the overhead of downloading each Dropbox file again.

Finally, build a search page

The search page is a driven by a simple Node server.js project that takes search queries from a web page, submits them to our Solr instance and returns JSON results to the page. A query for given text provided by the user looks like this:
var start = query.start || 0;
var options = { "fl" : "title,id,path,when,icon,size,mime_type",
    "defType" : "edismax",
    "qf" : "title text",
    "hl" : "on",
    "hl.fl" : "*",
    "start" : start };
solr.createClient().query(text, options, function(err, reply) {
    res.write(reply);
    res.end();
});

We use the ExtendedDisMax query parser, and specify the fields to return and various other options, then submit the query to Solr. As another useful feature, we pre-process custom fields such as "when:" in user queries to provide some shortcuts, for example:

// e.g. when:yesterday or when:2013-01 or when:2012-10-08
text = text.replace(/\bwhen:yesterday\b/g, "when:[NOW-1DAY/DAY TO NOW/DAY]");
text = text.replace(/\bwhen:(\d\d\d\d-\d\d)\b/g, "when:[$1-01T00:00:00Z TO $1-31T23:59:59Z]");
text = text.replace(/\bwhen:(\d\d\d\d-\d\d-\d\d)\b/g, "when:[$1T00:00:00Z TO $1T23:59:59Z]");

Finally, our client-side javascript handles converting JSON results to html, ajax paging and scrolling of results in regular increments, highlighting matching terms in text snippets, and back/forward actions using html5 pushState.

Thanks for reading, and feel free to contribute to the project on GitHub. Contact me on Twitter with any comments or questions.