Home‎ > ‎Programming Projects‎ > ‎

Search Engine

Overview:

The WhelanLabs Search Engine is a GUI interface to a preconfigured installation of Apache Nutch. The main purpose of this application is to support a simple means to provide a Windows-based search engine implementation for use in an organization with web-based resources that are inaccessible via traditional search engines.

The WhelanLabs Search Engine is freely available from the following sites:

Version 3.0.1: 

Download via Brothersoft

Download via CNET

http://whelanlabs.home.comcast.net/~whelanlabs/files/SearchEngineInstall_3.0.1.exe

Cost:

Free!

Management GUI:


Install and Run in under 10 Minutes
[Note: For those in China, the same video can be seen at http://www.56.com/u81/v_NzI0ODYxMTA.html ]

Administrative Features:

  • Supports HTTP proxies for crawling.
  • Supports site login for crawling
  • Supports both simple and advanced filtering criteria for crawling.
  • Supports generation of Crawler Reports.
  • Supports Windows Scheduled Events.
  • Supports Windows on Non-English locales.

 Supported File Types:

File Type

MIME Type

Extension(s)

Adobe Flash Files (AKA ‘Shockwave Flash’)

application/x-shockwave-flash

.swf

Adobe Portable Document Format

application/pdf

.pdf

ASCII Text Files

text/plain

.txt, .text

BZIP Over ZIP Compressed File Archive File

application/x-bzip2

.boz

C Shell Script Files

application/x-csh

.csh

Compressed GZIP files

application/x-gzip

.gz

eXtensible Markup Language Files

application/xml

.xml

eXtensible Markup Language Files

text/xml

.xml

HTML Files

text/html

.html, .htm

IETF SGML document Files

text/sgml

.sgm

JavaScript Files

application/x-javascript

.js

Kspread Spreadsheet Application Files

application/x-kspread

.ksp

Kword Word Processor Files

application/x-kword

.kwd, .kwt

Microsoft Excel Files

application/vnd.ms-excel

.xls

Microsoft PowerPoint Files

application/vnd.ms-powerpoint

.ppt

Microsoft Rich Text Files

text/rtf

.rtf

Microsoft Word Files

application/msword

.doc

OASIS Open Document Master Documents

application/vnd.oasis.opendocument.text-master

.odm

OASIS Open Document Presentation Templates

application/vnd.oasis.opendocument.presentation-template

otp

OASIS Open Document Presentations

application/vnd.oasis.opendocument.presentation

.odp

OASIS Open Document Spreadsheet Templates

application/vnd.oasis.opendocument.spreadsheet-template

.ots

OASIS Open Document Spreadsheets

application/vnd.oasis.opendocument.spreadsheet

.ods

OASIS Open Document Text Files

application/vnd.oasis.opendocument.text

.odt

OASIS Open Document Text template for HTML

application/vnd.oasis.opendocument.text-web

.oth

OASIS Open Document Text Templates

application/vnd.oasis.opendocument.text-template

.ott

OpenOffice Calc Files

application/vnd.sun.xml.calc

.sxc

OpenOffice Calc template Files

application/vnd.sun.xml.calc.template

.stc

OpenOffice Impress Files

application/vnd.sun.xml.impress .sxi

 

OpenOffice Impress Template Files

application/vnd.sun.xml.impress.template

.sti

OpenOffice Writer Files

application/vnd.sun.xml.writer

.sxw

OpenOffice Writer Template Files

application/vnd.sun.xml.writer.template

.stw

PostScript Files

application/postscript

.ps

Really Simple Syndication Files

application/rss+xml

.rss

Rich Text Files

text/richtext

.rt

Tab Separated Values Files

text/tab-separated-values

.tsv

XHTML Files

application/xhtml+xml

.xhtml

ZIP files

application/zip

.zip

  [Note: in addition to the file types mentioned above, support for additional file types can be added via the underlying Apache Tika framework. See Apache Tika for additional information.]

Supported Protocols:

Protocol

Protocol String

Regular Web Pages

HTTP://

Secure HTTP Web Pages

 HTTPS://

File Transfer Protocol

FTP://

E-mail Links

MAILTO://

Local Files

FILE://

 

Sizing and Capacity:

No known maximum limits have been published for Nutch. Anecdotal evidence suggests the existence of systems with 100-200 million documents. The main question for ‘sizing’ is available resources.

  • Disk Space: About 1.2 KB per document to index.
  • Indexing Rate: I’ve seen a 2 GHz machine 100 Mbit connection index ~ 250,000 pages/hour. (using the same thread settings included in this configuration)

 

Current data suggests that, on average, it should take the crawler 4 hours per 1 million URLS to process, and each 1 million URLs should take 1.2 GB of space to index. [Note: these numbers will be updated based on additional reports from the field. To report your results, please post to The WhelanLabs SearchEngine Forum.]

License:

This application is freely available under a GPL 3 license.

Architecture and Design:

The WhelanLabs Search Engine is basically a mash-up of technologies with an administrative user interface added. The technology stack for the application is:

Figure 1: Technology Stack for WhelanLabs Search Engine

 

There are a few areas within the application that merit mention. They are:

Use of templates: In order to support modifications to the configuration of underlying components, the application makes use of configuration file templates in several places to allow the application to overwrite the content of the configuration files. Most of the use of templates involves the configuration of Jetty, Apache Solr, and Apache Nutch. The use of templates does imply that direct modifications of the configuration files might be subsequently overwritten by actions performed by the application.

Apache Nutch configuration settings: The default configuration settings in Nutch have been changed to produce a system that has features that were seen as needed but missing from the OOTB Nutch configuration. Specifically, the types of searchable document types has been greatly extended, and the maximum amount of a file to be indexed has been increased to a size that I feel will produce less misses in searches without being so big as to overload the system.

Outgrowing the Application: It is entirely possible that sites might outgrow the Search Engine application due to special needs (related to configuration, management, performance, or a host of other special needs. I do not believe that there any problems with this approach, and might in fact be a logical evolution as administrators become more familiar with how the system works. Outgrowing the application might come in phases, starting with ‘under the covers configuration’, moving to direct command-line invocation of the shell scripts, and ending with complete abandonment of the administrative UI and replacement of some of the 3rd party components. While I accept and encourage this type of advancement, the Search Engine application will likely continue to primarily cater to the newbie, and focus the ongoing development efforts accordingly.

External Links:

A forum has been established on Nabble in order to answer questions related to this application. The forum is located at http://whelanlabs-search-engine.2313114.n4.nabble.com.

As for the underlying components, their links are as follows:

 

 

Comments