URL:
Create intelligent Web spiders and
Full article
Create intelligent Web spiders
How to use Java network objects and HTML objects
Summary
Have you ever wanted to create your own database of Websites that meet specific criteria? Web spiders, sometimes referred to as Web crawlers, are programs that follow Web links from one site to another, examining content and recording locations. Commercial search sites use Web spiders to populate their databases; researchers can use spiders to find relevant information. Creating your own spider allows you to control the search for content, domains, and Webpage characteristics, such as text density and embedded multimedia content. This article shows you how to create your own powerful Web spider in Java using Java HTML and network classes. (2,300 words; November 1, 2004)
By Mark O. Pendergast
Printer-friendly version |
Mail this to a friend
his article demonstrates how to create an intelligent Web spider based on standard Java network objects. The heart of this spider is a recursive routine that can perform depth-first Web searches based on keyword/phrase criteria and Webpage characteristics. Search progress displays graphically using a
JTree
structure. I address issues such as resolving relative URLs, avoiding reference loops, and monitoring memory/stack usage. In addition, I demonstrate the proper use of Java network objects used in accessing and parsing remote Webpages.
Spider demonstration program
The demonstration program consists of the user interface class
SpiderControl
; the Web-searching class
Spider
; the two classes used to build a
JTree
showing the results,
UrlTreeNode
and
UrlNodeRenderer
; and two classes to help verify integer input into the user interface,
IntegerVerifier
and
VerifierListener
. See
Resources for a link to the full source code and documentation.
The
SpiderControl
interface is composed of three tabs, one to set the search parameters, another to display the resulting search tree (
JTree
), and a third to display error and status messages—see Figure 1.
Figure 1. Search parameters tab. Click on thumbnail to view full-sized image.
|
Search parameters include the maximum number of sites to visit, the search's maximum depth (links to links to links), a list of keywords/phrases, the root-level domains to search, and the starting Website or portal. Once the user has entered the search parameters and pressed the Start button, the Web search will start, and the second tab (Figure 2) displays to show the search's progress.
Figure 2. Search tree. Click on thumbnail to view full-sized image.
|
An instance of the
Spider
class running in a separate thread conducts the Web search. Separate threads are used so that the
SpiderControl
module can continually update the search tree's display and process the Stop Search button. As the
Spider
runs, it continually adds nodes (
UrlTreeNode
) to the
JTree
displayed in the second tab. Search tree nodes that contain keywords and phrases appear in blue (
UrlNodeRenderer
).
When the search completes, the user can view the site's vital statistics and view the site itself in an external Web browser (the program defaults to Internet Explorer, located in the Program Files folder). The vital statistics include the keywords present, total text characters, total images, and total links.
The Spider class
The
Spider
class is responsible for searching the Web given a starting point (portal), a list of keywords and domains, and limits on the search's depth and size.
Spider
inherits
Thread
so it can run in a separate thread. This allows the
SpiderControl
module to continually update the search tree's display and process the Stop Search button.
The constructor method is passed the search parameters along with a reference to an empty
JTree
and an empty
JTextArea
. The
JTree
is used to create a hierarchical record of the sites visited as the search progress. This provides visual feedback to the user and helps the
Spider
track where it has been to prevent circular searches. The
JTextArea
posts error and progress messages.
The constructor stores its parameters in class variables and initializes the
JTree
to render nodes using the
UrlNodeRenderer
class. The search will not start until
SpiderControl
calls the
run()
method.
The
run()
method starts execution in a separate thread. It first determines whether the portal site is a Web reference (starting with http, ftp, or www) or a local file reference. It then ensures the portal site has the proper notation, resets the run statistics, and calls
searchWeb()
to begin the search:
public void run()
{
DefaultTreeModel treeModel = (DefaultTreeModel)searchTree.getModel(); // get our model
DefaultMutableTreeNode root = (DefaultMutableTreeNode)treeModel.getRoot();
String urllc = startSite.toLowerCase();
if(!urllc.startsWith("http://") && !urllc.startsWith("ftp://") &&
!urllc.startsWith("www."))
{
startSite = "file:///"+startSite; // Note you must have 3 slashes !
}
else // Http missing ?
if(urllc.startsWith("www."))
{
startSite = "http://"+startSite; // Tack on http://
}
startSite = startSite.replace('\\', '/'); // Fix bad slashes
sitesFound = 0;
sitesSearched = 0;
updateStats();
searchWeb(root,startSite); // Search the Web
messageArea.append("Done!\n\n");
}
searchWeb()
is a recursive method that accepts as parameters a parent node in the search tree and a Web address to search.
searchWeb()
first verifies that the given Website has not already been visited and that depth and site limits have not been exceeded.
searchWeb()
then yields to allow the
SpiderControl
thread to run (updating the screen and checking for Stop Search button presses). If all is in order,
searchWeb()
continues, if not, it returns.
Before
searchWeb()
begins reading and parsing the Website, it first verifies that the site is of the proper type and domain by creating a
URL
object based on the Website. The
URL
's protocol is checked to ensure it is either an HTML address or a file address (no need to search for "mailto:" and other protocols). Then the file extension (if present) is checked to ensure that it is an HTML file (no need to parse pdf or gif files). Once that is done, the domain is checked against the list specified by the user with the
isDomainOk()
method:
...URL url = new URL(urlstr); // Create the URL object from a string.
String protocol = url.getProtocol(); // Ask the URL for its protocol
if(!protocol.equalsIgnoreCase("http") && !protocol.equalsIgnoreCase("file"))
{
messageArea.append(" Skipping : "+urlstr+" not a http site\n\n");
return;
}
String path = url.getPath(); // Ask the URL for its path
int lastdot = path.lastIndexOf("."); // Check for file extension
if(lastdot > 0)
{
String extension = path.substring(lastdot); // Just the file extension
if(!extension.equalsIgnoreCase(".html") && !extension.equalsIgnoreCase(".htm"))
return; // Skip everything but html files
}
if(!isDomainOk(url))
{
messageArea.append(" Skipping : "+urlstr+" not in domain list\n\n");
return;
}
At this point,
searchWeb()
is fairly certain it has a URL worth searching, so it creates a new node for the search tree, adds it to the tree, opens an input stream, and parses the file. The following sections provide more details on parsing HTML files, resolving relative URLs, and controlling recursion.
Parsing HTML files
There are two ways to parse (pick apart) an HTML file to look for the
A HREF =
tags—a hard way and an easy way.
If you choose the hard way, you would create your own parsing algorithm using Java's
StreamTokenizer
class. With this technique, you'd have to specify the word and white-space characters for the
StreamTokenizer
object, then pick off the
<
and
>
symbols to find the tags, the attributes, and separate the text between tags. A lot of work.
The easy way is to use the built-in
ParserDelegator
class, a subclass of the
HTMLEditorKit.Parser
abstract class. These classes are not well documented in the Java documentation. Using
ParserDelegator
is a three-step process. First, create an
InputStreamReader
object from your URL; then, create an instance of a
ParserCallback
object; finally, create an instance of the
ParserDelegator
object and call its one public method
parse()
:
UrlTreeNode newnode = new UrlTreeNode(url); // Create the data node
InputStream in = url.openStream(); // Ask the URL object to create an input stream
InputStreamReader isr = new InputStreamReader(in); // Convert the stream to a reader
DefaultMutableTreeNode treenode = addNode(parentnode, newnode);
SpiderParserCallback cb = new SpiderParserCallback(treenode); // Create a callback object
ParserDelegator pd = new ParserDelegator(); // Create the delegator
pd.parse(isr,cb,true); // Parse the stream
isr.close(); // Close the stream
parse()
is passed an
InputStreamReader
, an instance of a
ParseCallback
object, and a flag for specifying whether the
CharSet
tags should be ignored.
parse()
then reads and decodes the HTML file, calling methods in the
ParserCallback
object each time it has completely decoded a tag or HTML element.
In the demonstration code, I implemented my
ParserCallback
as an inner class of
Spider
. Doing so allows
ParseCallback
to access
Spider
's methods and variables. Classes based on
ParserCallback
can override the following methods:
-
handleStartTag()
: Called when a starting HTML tag is encountered, e.g., >A <
-
handleEndTag()
: Called when an end HTML tag is encountered, e.g., >/A<
-
handleSimpleTag()
: Called when a HTML tag that has no matching end tag is encountered -
handleText()
: Called when text between tags is encountered
In the demonstration program, I overrode the
handleSimpleTag()
,
handleStartTag()
,
handleEndTag()
, and
handleTextTag()
methods.
I overrode
handleSimpleTag()
so that my code can process HTML
BASE
and
IMG
tags.
BASE
tags tell what URL to use when resolving relative URL references. If no
BASE
tag is present, then the current URL is used to resolve relative references.
handleSimpleTag()
is passed three parameters, an
HTML.Tag
object, a
MutableAttributeSet
that contains all the tag's attributes, and relative position within the file. My code checks the tag to see if it is a
BASE
object instance; if it is, then the HREF attribute is retrieved and stored in the page's data node. This attribute is used later when resolving URL addresses to linked Websites. Each time an
IMG
tag is encountered, that page's image count is updated.
I overrode
handleStartTag()
so that the program can process HTML
A
and
TITLE
tags. The method tests to see if the
t
parameter is in fact an
A
tag, if it is, then the HREF attribute is retrieved.
fixHref()
is called to clean up sloppy references (changes back slashes to forward slashes, adds missing final slashes). The link's URL is resolved by creating a
URL
object instance using the base URL and the one referenced. Then, a recursive call to
searchWeb()
processes this link. If the method encounters a
TITLE
tag, it clears the variable storing the last text encountered so that the title's end tag is assured of having the proper value (sometimes, a Webpage will have title tags with no title between them).
I overrode
handleEndTag()
so the HTML
TITLE
end tags can be processed. This end tag indicates that the previous text (stored in
lastText
) is the page's title text. This text is then stored in the page's data node. Since adding the title information to the data node will change the display of the data node in the tree, the
nodeChanged()
method must be called so the tree can adjust its layout.
I overrode
handleText()
so that the HTML page's text can be checked for any of the keywords or phrases being searched.
handleText()
is passed an array of characters and the position of the characters within the file.
handleText()
first converts the character array to a
String
object, converting to all uppercase in the process. Then each keyword/phrase in the search list is checked against the
String
object using the
indexOf()
method. If
indexOf()
returns a non-negative result, then the keyword/phrase is present in the page's text. If the keyword/phrase is present, the match is recorded in the node's match list and run statistics are updated:
public class SpiderParserCallback extends HTMLEditorKit.ParserCallback {
/**
* Inner class used to html handle parser callbacks
*/
public class SpiderParserCallback extends HTMLEditorKit.ParserCallback {
/** URL node being parsed */
private UrlTreeNode node;
/** Tree node */
private DefaultMutableTreeNode treenode;
/** Contents of last text element */
private String lastText = "";
/**
* Creates a new instance of SpiderParserCallback
* @param atreenode search tree node that is being parsed
*/
public SpiderParserCallback(DefaultMutableTreeNode atreenode) {
treenode = atreenode;
node = (UrlTreeNode)treenode.getUserObject();
}
/**
* Handle HTML tags that don't have a start and end tag
* @param t HTML tag
* @param a HTML attributes
* @param pos Position within file
*/
public void handleSimpleTag(HTML.Tag t,
MutableAttributeSet a,
int pos)
{
if(t.equals(HTML.Tag.IMG))
{
node.addImages(1);
return;
}
if(t.equals(HTML.Tag.BASE))
{
Object value = a.getAttribute(HTML.Attribute.HREF);
if(value != null)
node.setBase(fixHref(value.toString()));
}
}
/**
* Take care of start tags
* @param t HTML tag
* @param a HTML attributes
* @param pos Position within file
*/
public void handleStartTag(HTML.Tag t,
MutableAttributeSet a,
int pos)
{
if(t.equals(HTML.Tag.TITLE))
{
lastText="";
return;
}
if(t.equals(HTML.Tag.A))
{
Object value = a.getAttribute(HTML.Attribute.HREF);
if(value != null)
{
node.addLinks(1);
String href = value.toString();
href = fixHref(href);
try{
URL referencedURL = new URL(node.getBase(),href);
searchWeb(treenode, referencedURL.getProtocol()+"://"+referencedURL.getHost()+referencedURL.getPath());
}
catch (MalformedURLException e)
{
messageArea.append(" Bad URL encountered : "+href+"\n\n");
return;
}
}
}
}
/**
* Take care of start tags
* @param t HTML tag
* @param pos Position within file
*/
public void handleEndTag(HTML.Tag t,
int pos)
{
if(t.equals(HTML.Tag.TITLE) && lastText != null)
{
node.setTitle(lastText.trim());
DefaultTreeModel tm = (DefaultTreeModel)searchTree.getModel();
tm.nodeChanged(treenode);
}
}
/**
* Take care of text between tags, check against keyword list for matches, if
* match found, set the node match status to true
* @param data Text between tags
* @param pos position of text within Webpage
*/
public void handleText(char[] data, int pos)
{
lastText = new String(data);
node.addChars(lastText.length());
String text = lastText.toUpperCase();
for(int i = 0; i < keywordList.length; i++)
{
if(text.indexOf(keywordList[i]) >= 0)
{
if(!node.isMatch())
{
sitesFound++;
updateStats();
}
node.setMatch(keywordList[i]);
return;
}
}
}
}
Resolving and repairing URLS
When relative links to Webpages are encountered, you must build complete links based on their base URLs. Base URLs can be explicitly defined in a Webpage via the
BASE
tag or implicitly defined as the URL of the page holding the link. The Java
URL
object provides a constructor that handles the resolution for you, providing you give it links structured to its liking.
URL(URL context, String spec)
accepts the link in the
spec
parameter and the base URL in the
context
parameter. If
spec
is a relative link, the constructor will create a
URL
object using
context
to build the complete reference.
URL
prefers all URL specifications to be in the strict (Unix) format. Use of backslashes—which is allowed in the Microsoft Windows world—instead of forward slashes, result in bad references. Also, if
spec
or
context
refers to a directory (containing
index.html
or
default.html
), instead of an HTML file, it must have a final slash. The method
fixHref()
checks for these sloppy references and fixes them:
public static String fixHref(String href)
{
String newhref = href.replace('\\', '/'); // Fix sloppy Web references
int lastdot = newhref.lastIndexOf('.');
int lastslash = newhref.lastIndexOf('/');
if(lastslash > lastdot)
{
if(newhref.charAt(newhref.length()-1) != '/')
newhref = newhref+"/"; // Add missing /
}
return newhref;
}
Controlling recursion
searchWeb()
is initially called to search the starting Web address specified by the user. It then calls itself whenever an HTML link is found. This forms the basis of the "depth-first" search and can lead to two types of problems. First, and most critical, memory/stack overflow problems can result due to too many recursive calls. These will occur if there is a circular Web reference; that is, one Webpage links to another that links back to the first—a common occurrence in the World Wide Web. To prevent this,
searchWeb()
checks the search tree (via the
urlHasBeenVisited()
method) to see if the referenced page already exists. If it does, then the link is ignored. If you choose to implement a spider without a search tree, you still must maintain a list of sites visited (either in a
Vector
or array) so that you can determine if you are revisiting a site.
The second problem with recursion results from the nature of a depth-first search and the World Wide Web's structure. Depending on the chosen portal, a depth-first search could result in hundreds of recursive calls before the original link on the original page is completely processed. This has two undesirable consequences: first, memory/stack overflow could occur, and second, the pages being searched may be too far removed from the original portal to give meaningful results. To control this, I added a "maximum search depth" setting to the spider. The user may select how deep the number of levels can go (links to links to links); as each link is entered, the current depth is checked via a call to the
depthLimitExceeded()
method. If the limit is exceeded, the link is ignored. This test merely checks the level of the node in the
JTree
.
The demonstration program also adds a site limit, specified by the user, that can stop the search after the specified number of URLs has been examined, thus ensuring the program will eventually stop! The site limit is controlled via a simple integer counter
sitesSearched
that is updated and checked after each call to
searchWeb()
.
UrlTreeNode and UrlNodeRenderer
UrlTreeNode
and
UrlNodeRenderer
are classes for creating custom tree nodes to add to the
JTree
in the
SpiderControl
user interface.
UrlTreeNode
contains the statistics and URL information for each searched Website. The
UrlTreeNode
is stored in the
JTree
as the "user object" attribute of the standard
DefaultMutableTreeNode
objects. The data includes the ability to track keywords present in the node, the node's URL, the node's base URL, the number of links, images, and text characters, and whether the node matches the search criteria.
UrlTreeNodeRenderer
is an implementation of the
DefaultTreeCellRenderer
interface.
UrlTreeNodeRenderer
causes the node to display in blue text if the node contains matching keywords.
UrlTreeNodeRenderer
also incorporates a custom icon for the
JTreeNode
s. Custom display is achieved by overriding the
getTreeCellRendererComponent()
method (see below). This method creates a
Component
object to display in the tree. Most of
Component
's attributes are set by the superclass;
UrlTreeNodeRenderer
changes the text color (foreground) and icons:
public Component getTreeCellRendererComponent(
JTree tree,
Object value,
boolean sel,
boolean expanded,
boolean leaf,
int row,
boolean hasFocus) {
super.getTreeCellRendererComponent(
tree, value, sel,
expanded, leaf, row,
hasFocus);
UrlTreeNode node = (UrlTreeNode)(((DefaultMutableTreeNode)value).getUserObject());
if (node.isMatch()) // Set color
setForeground(Color.blue);
else
setForeground(Color.black);
if(icon != null) // Set a custom icon
{
setOpenIcon(icon);
setClosedIcon(icon);
setLeafIcon(icon);
}
return this;
}
Summary
This article has shown you how to create a Web spider and a user interface to control it. The user interface employs a
JTree
to track the spider's progress and record the sites visited. Alternatively, you could use a
Vector
to record the sites visited and display the progress using a simple counter. Other enhancements could include an interface to a database to record keywords and sites, adding the ability to search through multiple portals, screening sites with too much or too little text content, and giving the search engine synonym-search capabilities.
The
Spider
class shown in this article uses recursive calls to a search procedure. Alternatively, a separate thread with a new spider could be launched for each link encountered. This has the benefit of allowing connections to remote URLs to occur concurrently, speeding execution. However, note that some
JTree
objects, namely
DefaultMutableTreeNode
, are not thread-safe, and programmers must perform their own synchronization.
Printer-friendly version |
Mail this to a friend
About the author
Dr. Mark Pendergast is an associate professor at Florida Gulf Coast University. Pendergast received an M.S. and Ph.D. from the University of Arizona and a B.S.E. in electrical computer engineering from the University of Michigan. He has worked as an engineer for Control Data Corporation, Harris Corporation, Ventana Corporation, and has taught at the University of Florida and Washington State University. His works have appeared in books, journals, and he has presented his work at numerous conferences. He is a member of the ACM and IEEE computer societies.