10. nov. 2004

Create intelligent Web spiders

URL: Create intelligent Web spiders and Full article







Create intelligent Web spiders


How to use Java network objects and HTML objects





Summary


Have you ever wanted to create your own database of Websites that meet specific criteria? Web spiders, sometimes referred to as Web crawlers, are programs that follow Web links from one site to another, examining content and recording locations. Commercial search sites use Web spiders to populate their databases; researchers can use spiders to find relevant information. Creating your own spider allows you to control the search for content, domains, and Webpage characteristics, such as text density and embedded multimedia content. This article shows you how to create your own powerful Web spider in Java using Java HTML and network classes. (2,300 words; November 1, 2004)


By Mark O. Pendergast





Printer-friendly version
Printer-friendly version | Send this article to a friend Mail this to a friend








This article demonstrates how to create an intelligent Web spider based on standard Java network objects. The heart of this spider is a recursive routine that can perform depth-first Web searches based on keyword/phrase criteria and Webpage characteristics. Search progress displays graphically using a JTree structure. I address issues such as resolving relative URLs, avoiding reference loops, and monitoring memory/stack usage. In addition, I demonstrate the proper use of Java network objects used in accessing and parsing remote Webpages.


Spider demonstration program

The demonstration program consists of the user interface class SpiderControl; the Web-searching class Spider; the two classes used to build a JTree showing the results, UrlTreeNode and UrlNodeRenderer; and two classes to help verify integer input into the user interface, IntegerVerifier and VerifierListener. See Resources for a link to the full source code and documentation.




The SpiderControl interface is composed of three tabs, one to set the search parameters, another to display the resulting search tree (JTree), and a third to display error and status messages—see Figure 1.

















Figure 1. Search parameters tab. Click on thumbnail to view full-sized image.







Search parameters include the maximum number of sites to visit, the search's maximum depth (links to links to links), a list of keywords/phrases, the root-level domains to search, and the starting Website or portal. Once the user has entered the search parameters and pressed the Start button, the Web search will start, and the second tab (Figure 2) displays to show the search's progress.

















Figure 2. Search tree. Click on thumbnail to view full-sized image.






An instance of the Spider class running in a separate thread conducts the Web search. Separate threads are used so that the SpiderControl module can continually update the search tree's display and process the Stop Search button. As the Spider runs, it continually adds nodes (UrlTreeNode) to the JTree displayed in the second tab. Search tree nodes that contain keywords and phrases appear in blue (UrlNodeRenderer).




When the search completes, the user can view the site's vital statistics and view the site itself in an external Web browser (the program defaults to Internet Explorer, located in the Program Files folder). The vital statistics include the keywords present, total text characters, total images, and total links.


The Spider class

The Spider class is responsible for searching the Web given a starting point (portal), a list of keywords and domains, and limits on the search's depth and size. Spider inherits Thread so it can run in a separate thread. This allows the SpiderControl module to continually update the search tree's display and process the Stop Search button.




The constructor method is passed the search parameters along with a reference to an empty JTree and an empty JTextArea. The JTree is used to create a hierarchical record of the sites visited as the search progress. This provides visual feedback to the user and helps the Spider track where it has been to prevent circular searches. The JTextArea posts error and progress messages.




The constructor stores its parameters in class variables and initializes the JTree to render nodes using the UrlNodeRenderer class. The search will not start until SpiderControl calls the run() method.



The run() method starts execution in a separate thread. It first determines whether the portal site is a Web reference (starting with http, ftp, or www) or a local file reference. It then ensures the portal site has the proper notation, resets the run statistics, and calls searchWeb() to begin the search:







public void run()


{


DefaultTreeModel treeModel = (DefaultTreeModel)searchTree.getModel(); // get our model


DefaultMutableTreeNode root = (DefaultMutableTreeNode)treeModel.getRoot();


String urllc = startSite.toLowerCase();


if(!urllc.startsWith("http://") && !urllc.startsWith("ftp://") &&


!urllc.startsWith("www."))


{


startSite = "file:///"+startSite; // Note you must have 3 slashes !


}


else // Http missing ?


if(urllc.startsWith("www."))


{


startSite = "http://"+startSite; // Tack on http://


}





startSite = startSite.replace('\\', '/'); // Fix bad slashes





sitesFound = 0;


sitesSearched = 0;


updateStats();


searchWeb(root,startSite); // Search the Web


messageArea.append("Done!\n\n");


}








searchWeb() is a recursive method that accepts as parameters a parent node in the search tree and a Web address to search. searchWeb() first verifies that the given Website has not already been visited and that depth and site limits have not been exceeded. searchWeb() then yields to allow the SpiderControl thread to run (updating the screen and checking for Stop Search button presses). If all is in order, searchWeb() continues, if not, it returns.




Before searchWeb() begins reading and parsing the Website, it first verifies that the site is of the proper type and domain by creating a URL object based on the Website. The URL's protocol is checked to ensure it is either an HTML address or a file address (no need to search for "mailto:" and other protocols). Then the file extension (if present) is checked to ensure that it is an HTML file (no need to parse pdf or gif files). Once that is done, the domain is checked against the list specified by the user with the isDomainOk() method:






...URL url = new URL(urlstr); // Create the URL object from a string.





String protocol = url.getProtocol(); // Ask the URL for its protocol


if(!protocol.equalsIgnoreCase("http") && !protocol.equalsIgnoreCase("file"))


{


messageArea.append(" Skipping : "+urlstr+" not a http site\n\n");


return;


}





String path = url.getPath(); // Ask the URL for its path


int lastdot = path.lastIndexOf("."); // Check for file extension


if(lastdot > 0)


{


String extension = path.substring(lastdot); // Just the file extension


if(!extension.equalsIgnoreCase(".html") && !extension.equalsIgnoreCase(".htm"))


return; // Skip everything but html files


}





if(!isDomainOk(url))


{


messageArea.append(" Skipping : "+urlstr+" not in domain list\n\n");


return;


}









At this point, searchWeb() is fairly certain it has a URL worth searching, so it creates a new node for the search tree, adds it to the tree, opens an input stream, and parses the file. The following sections provide more details on parsing HTML files, resolving relative URLs, and controlling recursion.


Parsing HTML files

There are two ways to parse (pick apart) an HTML file to look for the A HREF = tags—a hard way and an easy way.



If you choose the hard way, you would create your own parsing algorithm using Java's StreamTokenizer class. With this technique, you'd have to specify the word and white-space characters for the StreamTokenizer object, then pick off the < and > symbols to find the tags, the attributes, and separate the text between tags. A lot of work.




The easy way is to use the built-in ParserDelegator class, a subclass of the HTMLEditorKit.Parser abstract class. These classes are not well documented in the Java documentation. Using ParserDelegator is a three-step process. First, create an InputStreamReader object from your URL; then, create an instance of a ParserCallback object; finally, create an instance of the ParserDelegator object and call its one public method parse():







UrlTreeNode newnode = new UrlTreeNode(url); // Create the data node


InputStream in = url.openStream(); // Ask the URL object to create an input stream


InputStreamReader isr = new InputStreamReader(in); // Convert the stream to a reader


DefaultMutableTreeNode treenode = addNode(parentnode, newnode);


SpiderParserCallback cb = new SpiderParserCallback(treenode); // Create a callback object


ParserDelegator pd = new ParserDelegator(); // Create the delegator


pd.parse(isr,cb,true); // Parse the stream


isr.close(); // Close the stream








parse() is passed an InputStreamReader, an instance of a ParseCallback object, and a flag for specifying whether the CharSet tags should be ignored. parse() then reads and decodes the HTML file, calling methods in the ParserCallback object each time it has completely decoded a tag or HTML element.




In the demonstration code, I implemented my ParserCallback as an inner class of Spider. Doing so allows ParseCallback to access Spider's methods and variables. Classes based on ParserCallback can override the following methods:



  • handleStartTag(): Called when a starting HTML tag is encountered, e.g., >A <
  • handleEndTag(): Called when an end HTML tag is encountered, e.g., >/A<
  • handleSimpleTag(): Called when a HTML tag that has no matching end tag is encountered
  • handleText(): Called when text between tags is encountered




In the demonstration program, I overrode the handleSimpleTag(), handleStartTag(), handleEndTag(), and handleTextTag() methods.



I overrode handleSimpleTag() so that my code can process HTML BASE and IMG tags. BASE tags tell what URL to use when resolving relative URL references. If no BASE tag is present, then the current URL is used to resolve relative references. handleSimpleTag() is passed three parameters, an HTML.Tag object, a MutableAttributeSet that contains all the tag's attributes, and relative position within the file. My code checks the tag to see if it is a BASE object instance; if it is, then the HREF attribute is retrieved and stored in the page's data node. This attribute is used later when resolving URL addresses to linked Websites. Each time an IMG tag is encountered, that page's image count is updated.




I overrode handleStartTag() so that the program can process HTML A and TITLE tags. The method tests to see if the t parameter is in fact an A tag, if it is, then the HREF attribute is retrieved.




fixHref() is called to clean up sloppy references (changes back slashes to forward slashes, adds missing final slashes). The link's URL is resolved by creating a URL object instance using the base URL and the one referenced. Then, a recursive call to searchWeb() processes this link. If the method encounters a TITLE tag, it clears the variable storing the last text encountered so that the title's end tag is assured of having the proper value (sometimes, a Webpage will have title tags with no title between them).



I overrode handleEndTag() so the HTML TITLE end tags can be processed. This end tag indicates that the previous text (stored in lastText) is the page's title text. This text is then stored in the page's data node. Since adding the title information to the data node will change the display of the data node in the tree, the nodeChanged() method must be called so the tree can adjust its layout.




I overrode handleText() so that the HTML page's text can be checked for any of the keywords or phrases being searched. handleText() is passed an array of characters and the position of the characters within the file. handleText() first converts the character array to a String object, converting to all uppercase in the process. Then each keyword/phrase in the search list is checked against the String object using the indexOf() method. If indexOf() returns a non-negative result, then the keyword/phrase is present in the page's text. If the keyword/phrase is present, the match is recorded in the node's match list and run statistics are updated:







public class SpiderParserCallback extends HTMLEditorKit.ParserCallback {

/**


* Inner class used to html handle parser callbacks


*/


public class SpiderParserCallback extends HTMLEditorKit.ParserCallback {


/** URL node being parsed */


private UrlTreeNode node;


/** Tree node */


private DefaultMutableTreeNode treenode;


/** Contents of last text element */


private String lastText = "";


/**


* Creates a new instance of SpiderParserCallback


* @param atreenode search tree node that is being parsed


*/


public SpiderParserCallback(DefaultMutableTreeNode atreenode) {


treenode = atreenode;


node = (UrlTreeNode)treenode.getUserObject();


}


/**


* Handle HTML tags that don't have a start and end tag


* @param t HTML tag


* @param a HTML attributes


* @param pos Position within file


*/


public void handleSimpleTag(HTML.Tag t,


MutableAttributeSet a,


int pos)


{


if(t.equals(HTML.Tag.IMG))


{


node.addImages(1);


return;


}


if(t.equals(HTML.Tag.BASE))


{


Object value = a.getAttribute(HTML.Attribute.HREF);


if(value != null)


node.setBase(fixHref(value.toString()));





}





}


/**


* Take care of start tags


* @param t HTML tag


* @param a HTML attributes


* @param pos Position within file


*/


public void handleStartTag(HTML.Tag t,


MutableAttributeSet a,


int pos)


{


if(t.equals(HTML.Tag.TITLE))


{


lastText="";


return;


}


if(t.equals(HTML.Tag.A))


{


Object value = a.getAttribute(HTML.Attribute.HREF);


if(value != null)


{


node.addLinks(1);


String href = value.toString();


href = fixHref(href);


try{


URL referencedURL = new URL(node.getBase(),href);


searchWeb(treenode, referencedURL.getProtocol()+"://"+referencedURL.getHost()+referencedURL.getPath());


}


catch (MalformedURLException e)


{


messageArea.append(" Bad URL encountered : "+href+"\n\n");


return;


}


}


}





}





/**


* Take care of start tags


* @param t HTML tag


* @param pos Position within file


*/


public void handleEndTag(HTML.Tag t,


int pos)


{


if(t.equals(HTML.Tag.TITLE) && lastText != null)


{


node.setTitle(lastText.trim());


DefaultTreeModel tm = (DefaultTreeModel)searchTree.getModel();


tm.nodeChanged(treenode);


}





}


/**


* Take care of text between tags, check against keyword list for matches, if


* match found, set the node match status to true


* @param data Text between tags


* @param pos position of text within Webpage


*/


public void handleText(char[] data, int pos)


{


lastText = new String(data);


node.addChars(lastText.length());


String text = lastText.toUpperCase();


for(int i = 0; i < keywordList.length; i++)


{


if(text.indexOf(keywordList[i]) >= 0)


{


if(!node.isMatch())


{


sitesFound++;


updateStats();


}


node.setMatch(keywordList[i]);


return;


}


}


}








}







Resolving and repairing URLS

When relative links to Webpages are encountered, you must build complete links based on their base URLs. Base URLs can be explicitly defined in a Webpage via the BASE tag or implicitly defined as the URL of the page holding the link. The Java URL object provides a constructor that handles the resolution for you, providing you give it links structured to its liking.



URL(URL context, String spec) accepts the link in the spec parameter and the base URL in the context parameter. If spec is a relative link, the constructor will create a URL object using context to build the complete reference. URL prefers all URL specifications to be in the strict (Unix) format. Use of backslashes—which is allowed in the Microsoft Windows world—instead of forward slashes, result in bad references. Also, if spec or context refers to a directory (containing index.html or default.html), instead of an HTML file, it must have a final slash. The method fixHref() checks for these sloppy references and fixes them:







public static String fixHref(String href)


{


String newhref = href.replace('\\', '/'); // Fix sloppy Web references


int lastdot = newhref.lastIndexOf('.');


int lastslash = newhref.lastIndexOf('/');


if(lastslash > lastdot)


{


if(newhref.charAt(newhref.length()-1) != '/')


newhref = newhref+"/"; // Add missing /


}





return newhref;





}







Controlling recursion

searchWeb() is initially called to search the starting Web address specified by the user. It then calls itself whenever an HTML link is found. This forms the basis of the "depth-first" search and can lead to two types of problems. First, and most critical, memory/stack overflow problems can result due to too many recursive calls. These will occur if there is a circular Web reference; that is, one Webpage links to another that links back to the first—a common occurrence in the World Wide Web. To prevent this, searchWeb() checks the search tree (via the urlHasBeenVisited() method) to see if the referenced page already exists. If it does, then the link is ignored. If you choose to implement a spider without a search tree, you still must maintain a list of sites visited (either in a Vector or array) so that you can determine if you are revisiting a site.




The second problem with recursion results from the nature of a depth-first search and the World Wide Web's structure. Depending on the chosen portal, a depth-first search could result in hundreds of recursive calls before the original link on the original page is completely processed. This has two undesirable consequences: first, memory/stack overflow could occur, and second, the pages being searched may be too far removed from the original portal to give meaningful results. To control this, I added a "maximum search depth" setting to the spider. The user may select how deep the number of levels can go (links to links to links); as each link is entered, the current depth is checked via a call to the depthLimitExceeded() method. If the limit is exceeded, the link is ignored. This test merely checks the level of the node in the JTree.



The demonstration program also adds a site limit, specified by the user, that can stop the search after the specified number of URLs has been examined, thus ensuring the program will eventually stop! The site limit is controlled via a simple integer counter sitesSearched that is updated and checked after each call to searchWeb().


UrlTreeNode and UrlNodeRenderer


UrlTreeNode and UrlNodeRenderer are classes for creating custom tree nodes to add to the JTree in the SpiderControl user interface. UrlTreeNode contains the statistics and URL information for each searched Website. The UrlTreeNode is stored in the JTree as the "user object" attribute of the standard DefaultMutableTreeNode objects. The data includes the ability to track keywords present in the node, the node's URL, the node's base URL, the number of links, images, and text characters, and whether the node matches the search criteria.




UrlTreeNodeRenderer is an implementation of the DefaultTreeCellRenderer interface. UrlTreeNodeRenderer causes the node to display in blue text if the node contains matching keywords. UrlTreeNodeRenderer also incorporates a custom icon for the JTreeNodes. Custom display is achieved by overriding the getTreeCellRendererComponent() method (see below). This method creates a Component object to display in the tree. Most of Component's attributes are set by the superclass; UrlTreeNodeRenderer changes the text color (foreground) and icons:







public Component getTreeCellRendererComponent(


JTree tree,


Object value,


boolean sel,


boolean expanded,


boolean leaf,


int row,


boolean hasFocus) {





super.getTreeCellRendererComponent(


tree, value, sel,


expanded, leaf, row,


hasFocus);





UrlTreeNode node = (UrlTreeNode)(((DefaultMutableTreeNode)value).getUserObject());


if (node.isMatch()) // Set color


setForeground(Color.blue);


else


setForeground(Color.black);





if(icon != null) // Set a custom icon


{


setOpenIcon(icon);


setClosedIcon(icon);


setLeafIcon(icon);


}








return this;


}







Summary

This article has shown you how to create a Web spider and a user interface to control it. The user interface employs a JTree to track the spider's progress and record the sites visited. Alternatively, you could use a Vector to record the sites visited and display the progress using a simple counter. Other enhancements could include an interface to a database to record keywords and sites, adding the ability to search through multiple portals, screening sites with too much or too little text content, and giving the search engine synonym-search capabilities.




The Spider class shown in this article uses recursive calls to a search procedure. Alternatively, a separate thread with a new spider could be launched for each link encountered. This has the benefit of allowing connections to remote URLs to occur concurrently, speeding execution. However, note that some JTree objects, namely DefaultMutableTreeNode, are not thread-safe, and programmers must perform their own synchronization.






















Join the discussion about this articleClick Here To Add Your Comment

.
.

Create intelligent Web spiders



JavaWorldAdministrator


10/30/04 09:53 PM











Printer-friendly version
Printer-friendly version | Send this article to a friend Mail this to a friend




About the author


Dr. Mark Pendergast is an associate professor at Florida Gulf Coast University. Pendergast received an M.S. and Ph.D. from the University of Arizona and a B.S.E. in electrical computer engineering from the University of Michigan. He has worked as an engineer for Control Data Corporation, Harris Corporation, Ventana Corporation, and has taught at the University of Florida and Washington State University. His works have appeared in books, journals, and he has presented his work at numerous conferences. He is a member of the ACM and IEEE computer societies.






Ingen kommentarer: