My friend and colleague Mark mentioned to me recently that one of his clients was interested in having a search engine on their website and did I have any ideas?
The scenario was this: the site is an informational site, with monthly updates and is hosted in the AWS cloud. It runs in a minimal instance of Linux, with only 1 GB RAM and very tight storage. It’s not an e-commerce site. Was there something small and lean enough to serve?
Mark and I had once worked together on a project for a different client where we installed Apache Solr to build a sophisticated search engine for large amounts of data, but Solr would be massive overkill for the site in question.
GNU Grep to the Rescue
As I thought about a solution for this small site, I immediately thought of grep, the open-source search utility with a long Unix heritage that can absolutely rip through text files to search for words or phrases and show them in context. All it needs are some text files to aim at.
The site in question has a large number of PDF files and HTML files. What, I thought, if copies of these were converted to plain text files and placed in data directory where grep could rapidly search through them? Text files could substitute for the usual inverted index of search engines and, at the same time, have a much smaller footprint on the system. The client wasn’t looking for fancy searches.
Similarly, grep doesn’t need much memory to run in. Furthermore, a lightweight website search engine based on grep could be built with a few day’s programming and testing. After getting the go-ahead to start programming, I invoked vim and began building a simple system.
Building the Text Database, or Index
I knew I’d use Pandoc to convert html files to plain text, but I needed something to convert the PDFs. I discovered the command-line utility pdftotext that is part of xpdf-tools in Linux. (For MacOS, Homebrew installs the utility when you install xpdf.) Between these two, pandoc and pdftotext, I had to tools for building a text database.
To that end, I wrote a batch-processing script in Perl, buildindex.pl that takes the results of a find command that selects all the PDF and HTML files on the site, and processed them through pandoc or pdftotext, putting the resulting text files into a collective data directory called textdata. The script also checks an exclude.txt file that can used to exclude directories that contain private information.
Embedded Filename Metadata
GNU Grep is not a fully-featured search engine, but with a little help from the GNU ls command I was able to prepend the date of last creation or update (mtime format) to the filename so it could later be sorted into most recently updated work to display at the top of a search.
The batch script populates the textdata directory with files that look like this:
Breaking this down, the initial part of the saved text file — 1645460076dot2987186510 — is the date in mtime format.
The word dot indicates an initial dot (.) in the relative pathname, and every _99_ represents a forward slash (/) in the original pathname. The added new file extension is .txt.
This metadata allows the search program to quickly reconstruct the path back to the original document, and to replace .txt with the original extension name.
The Search Module
The client’s website is powered by PHP, so that is the language I used for the search module.
A search form module, searchform.php, prompts for a search term or phrase, which is then passed to the main search program, search.php. The search.php script, in turn, calls on grep to do the search and stores the results in an array that is then reverse sorted. Looping through the array, the search script reconstructs the full path and original extension of the filename, turning it an <a href> HTML link.
To make the results easier to read, search terms found in the results are highlighted in red, to make them stand out in context. Overall appearance is controlled in HTML with an embedded CSS style sheet.
The results, reflecting the song lyrics in my test site, look like this:
Context and Word Boundaries
To refine the search somewhat, the searchform.php file offers two checkboxes. The first allows the searcher to search on whole words and phrases, or do stem searching. In a whole word search, the default setting, the word “train” for example would find instances of “train”, “Train”, or “TRAIN” as a whole word surrounded by spaces or by punctuation. A stem search on “train” would find “train”, “Train”, or “TRAIN” as well, but also things like “trains,” “training,” and “restrain.” This is sometimes useful as an option.
The second checkbox specifies the amount of context surrounding the search term. The default is up to 90 characters on either side of the term. Unchecking the box results in a context of three lines of text: the line before the search term is found, the line it’s in, and the line following.
To keep the search index, or text data directory, in sync with the information on the site, the buildindex.plscript uses brute force. It deletes everything in textdata/ and rebuilds it from scratch. What this lacks in sophistication it makes up for it in efficiency. It takes no more than five minutes to rebuild the index for the entire site, which can be run manually when needed, or run as a cron job at desired times.
To our delight, this lightweight, batch-oriented search engine is speedy, and is well suited to the needs of the client. In honour of grep, we named the search system Greppy.
Greppy follows the Unix philosophy of using existing discrete utilities combined together to process text files. There is no need to reinvent the wheel.
To make this engine available to others who might have a use for it, it is available here at Github.
Gene Wilburn is a Canadian IT specialist and technical writer