A Recoll Indexing Filter for Lotus Notes Databases
This filter application provides a mechanism for extracting and indexing all of the documents contained in a Lotus Notes database with the Recoll text search tool on a Linux system.
Requirements:
You must have the Linux version of the IBM Lotus Notes desktop client installed on the computer where you install this filter. The filter uses the Java API that is installed as part of the client in order to access the database files.
You must also, of course, have Jean-Francois Dockes' Recoll text search tool installed.
Installation:
If you are updating from a previous release, only steps 1 and 2 are required.
- The filter is packaged as a ZIP file. Download and unpack this ZIP file to a temporary directory on your computer:
- Copy these files to the Recoll filters directory, usually this is the /usr/share/recoll/filters directory:
rcllnotes.jar
rcllnotes
rclOpenLotusNotesClient
- Add the following line to the mimemap file in your Recoll configuration directory, usually this is the ~/.recoll/mimemap file:
.nsf = application/x-extension-nsf
See the examples/mimemap file contained in the ZIP file for an example.
- Add the following line to the [index] section of the mimeconf file in your Recoll configuration directory, usually this is the ~/.recoll/mimeconf file:
application/x-extension-nsf = execm rcllnotes
See the examples/mimeconf file contained in the ZIP file for an example.
- Add the following line to the [stored] section of the fields file in your Recoll configuration directory, usually this is the ~/.recoll/fields file:
See the examples/fields file contained in the ZIP file for an example.
- Add the following line to the top of the mimeview file in your Recoll configuration directory, usually this is the ~/.recoll/mimeview file. This would go above the [view] section in the file, if there is one:
xallexcepts = application/pdf application/postscript application/x-dvi tex\
t/html|gnuinfo text/html|chm text/html|epub text/html|notesd\
oc
If your mimeview file already contains this line, please just add the text/html|notesdoc string to the end.
Add the following line to the [view] section of the mimeview file:
/text/html|notesdoc = /usr/share/recoll/filters/rclOpenNotesClient %f
If you do not have Recoll installed in the default location, please adjust the path to the rclOpenNotesClient script above accordingly.
See the examples/mimeview file contained in the ZIP file for examples.
- Copy the examples/.rcllnotes configuration file to your home directory: ~/.rcllnotes. Edit that file and put your Lotus Notes password into it in the location specified in the file. You may also enable the other settings in this file if you wish, however they are all optional and are not required for the filter to function properly in most cases. Almost all of those optional settings will cause the indexing process to run more slowly than it normally would.
The installation is now complete. When recollindex runs it should index all of the Lotus Notes databases on your computer. Since these databases can be very large and may contain thousands of documents, do not be surprised if recollindex runs for much longer than you have previously been used to.
Remember that you can control which files are indexed through the preference settings in the Recoll GUI or by directly editing your Recoll config file.
Also, if you run recollindex interactively and watch the console output, do not be surprised if you see long pauses in recollindex processing while it is working on a Lotus Notes database. The rcllnotes filter uses a Java application to extract all of the documents from the database as part of the process and you will not see any console output from recollindex while that is happening. For (very) large databases this pause can be (very) lengthy.
The ~/.rcllnotes configuration file contains a debug log setting that you can enable to instruct the Java application to output detailed information about its activity to a log file. You can monitor that log file during processing to see what the Java application is doing if you are concerned. This log can also be useful in troubleshooting problems.
Features, Quirks, and FAQs:
What is indexed? - Notes documents and their attachments are converted into HTML format by this filter, and with the help of the rest of the Recoll filters the resulting text is indexed by recollindex. This filter ignores graphics that are embedded in Notes documents as there doesn’t seem to be much meaningful metadata associated with them. This filter does process graphic images that are attachments however, since they might contain interesting metadata tags.
- The Preview and Open links in search results lists - The Open link works. The Preview link does not, most of the time. Indexed attachments are opened in their respective applications. Notes documents are opened in the Notes client.
The Preview link will probably never work reliably with Notes documents or attachments. This is due to the fact that the Recoll application compares the time the file was indexed with the current time of the Notes database file they came out of and refuses to open the preview if the database file has a later time. Given the dynamic nature of Notes database files and the many actions that can update them, including just reading from them, it’s unlikely that these two different timestamps will match. And therefore unlikely that the Preview link will work.
- The indexing process takes a long time - Yes it does. The process of extracting documents from a Notes database is slow work. There can be thousands of documents and attachments in a database, even tens of thousands. The recollindex application, and therefore this filter, runs with the lowest possible priority on the system in order to have as little impact as possible on the other work you are doing. All of these factors contribute to the length of time it can take to index a Lotus Notes database. I have a mail archive database which contains just under 10,000 documents and attachment files. It takes 15 minutes to extract and index this file and that's on a system with an 8-way i7 Intel CPU and 32GB of RAM. Have patience.
Consider setting up a separate index and indexing run for Notes databases. See Configurations, multiple indexes in the Recoll manual for guidance on that. I have one index for my normal files, a second index for my active Notes databases, and a third index for my inactive archive databases. Each index is updated by recollindex on a different schedule specified in my crontab file. My Recoll search GUI is configured to search all three of these indexes as if they were one.
The only potential “gotcha” with this configuration is when two instances of recollindex running at the same time both try to open the same Notes database at the same time. When that happens there will likely be a deadlock with both indexing processes hanging indefinitely. I've seen it happen. The trick is to be sure that the topdirs and skippedPaths settings in your respective recoll.conf files prevent any overlaps in the directories and files that will be processed by each of the instances you are setting up.
Performance impacts of multi-threading in Recoll 1.9 - In version 1.9 Recoll became multi-threaded. This means that Recoll will invoke multiple instances of the rcllnotes filter simultaneously. Those multiple instances can chew up a lot of CPU and memory and can slow down your system. If this happens to you, you will want to read about the configuration options that allow you to control the number of "file conversion and data extraction" threads that are spawned by Recoll. See the details in the Recoll manual.
The value that needs to be adjusted is the first number in the “thrQSizes” and “thrTCounts” strings. You will need to experiment to find the right values for your particular system, there is no single “correct” answer.
Release History:
1.0.0 |
|
2012-12-11 |
1.1.0 |
- Updated to accept additional title field types
- Fixed minor typo in Installation section of README
|
2012-30-11 |
1.2.0 |
- Updated for Notes 9
- Updated the RecollFilter Java application that extracts documents from databases to use a new thread model documented with Notes 9. This is supposed to be backward compatible with Notes 8.5 but I am not able to test that.
- Added "-Xmx1g" JVM option in rcllnotes filter script to increase maximum heap size to 1GB in order to eliminate exceptions when processing larger Notes DBs.
|
2013-08-01 |
How The Filter Works:
- Assuming that the various ~/.recoll/mime* files are set up properly, (see the Installation section above) when recollindex encounters a Lotus Notes database (*.nsf) file it invokes the rcllnotes filter to open the file.
- The filter invokes a Java application that opens the database file and reads every document in the database in turn, extracting it in XML format. During the extraction process the document's XML is transformed into HTML using an XSLT stylesheet. The Java application then walks through this HTML, extracting any attachment files from the HTML. It then pipes these attachment files back to the rcllnotes filter as individual files in base64 text format. After handling all of the attachments, the Java application pipes the remaining Notes document HTML back to the rcllnotes filter. It repeats this process until all of the database file’s documents have been extracted.
- The rcllnotes filter takes the stream of data that is piped to it by the Java application and parses it into individual files. The filter pipes each of these files back to recollindex for indexing. The attachment files that are part of the stream are converted from their base64 format into their original binary format before they are piped to recollindex. Recollindex may submit these attachment files to other filters for further processing, depending on their mime type. Recollindex ultimately inserts the contents of the Notes documents and the attachments into its index where it is available for searching.
- The rcllnotes filter is also invoked by the recoll GUI application when you elect to open an attachment or Notes document that has been presented to you as part of a search result. In this case the filter is given both the *.nsf file’s name and the Lotus Notes document UNID and asked to retrieve the document. It does so, fetching that individual document and subjecting it to the same XML/XSLT conversion and attachment processing described above before returning the results to the recoll application. If the file you selected is an attachment, recoll will open it using it’s normal process for that type of file. If it is a Notes document then the recoll application passes the HTML representation of the document that the filter produced on to the command specified for Notes documents in the mimeview file, the rclOpenNotesClient script. That Python script extracts the NotesURL for the document from inside the HTML file and invokes the Notes client, passing the NotesURL to it on the command line. The client will then start up, if it isn’t already started, and open that document. If Notes is already started, the requested document will simply be opened in the existing client window.
Support:
There is no formal support for this code. I will provide "best-effort" support in as far as my real life will allow. I use the discussion forum and the bug tracking system here on SourceForge for these purposes. Please make use of them if you have problems or questions.
Licensing:
This code is licensed under the terms of the GNU GPL v3 license.
|