The WhelanLabs Search Engine is a GUI interface to a preconfigured installation of Apache Nutch. The main purpose of this application is to support a simple means to provide a Windows-based search engine implementation for use in an organization with web-based resources that are inaccessible via traditional search engines.
The WhelanLabs Search Engine is freely available from the following sites:
Install and Run in under 10 Minutes
[Note: For those in China, the same video can be seen at http://www.56.com/u81/v_NzI0ODYxMTA.html ]
[Note: in addition to the file types mentioned above, support for additional file types can be added via the underlying Apache Tika framework. See Apache Tika for additional information.]
No known maximum limits have been published for Nutch. Anecdotal evidence suggests the existence of systems with 100-200 million documents. The main question for ‘sizing’ is available resources.
Current data suggests that, on average, it should take the crawler 4 hours per 1 million URLS to process, and each 1 million URLs should take 1.2 GB of space to index. [Note: these numbers will be updated based on additional reports from the field. To report your results, please post to The WhelanLabs SearchEngine Forum.]
This application is freely available under a GPL 3 license.
The WhelanLabs Search Engine is basically a mash-up of technologies with an administrative user interface added. The technology stack for the application is:
Figure 1: Technology Stack for WhelanLabs Search Engine
There are a few areas within the application that merit mention. They are:
Use of templates: In order to support modifications to the configuration of underlying components, the application makes use of configuration file templates in several places to allow the application to overwrite the content of the configuration files. Most of the use of templates involves the configuration of Jetty, Apache Solr, and Apache Nutch. The use of templates does imply that direct modifications of the configuration files might be subsequently overwritten by actions performed by the application.
Apache Nutch configuration settings: The default configuration settings in Nutch have been changed to produce a system that has features that were seen as needed but missing from the OOTB Nutch configuration. Specifically, the types of searchable document types has been greatly extended, and the maximum amount of a file to be indexed has been increased to a size that I feel will produce less misses in searches without being so big as to overload the system.
Outgrowing the Application: It is entirely possible that sites might outgrow the Search Engine application due to special needs (related to configuration, management, performance, or a host of other special needs. I do not believe that there any problems with this approach, and might in fact be a logical evolution as administrators become more familiar with how the system works. Outgrowing the application might come in phases, starting with ‘under the covers configuration’, moving to direct command-line invocation of the shell scripts, and ending with complete abandonment of the administrative UI and replacement of some of the 3rd party components. While I accept and encourage this type of advancement, the Search Engine application will likely continue to primarily cater to the newbie, and focus the ongoing development efforts accordingly.
A forum has been established on Nabble in order to answer questions related to this application. The forum is located at http://whelanlabs-search-engine.2313114.n4.nabble.com.
As for the underlying components, their links are as follows: