Extract links from an HTML page : HTML Parser « Network Protocol « Java






Extract links from an HTML page

  

import java.io.FileReader;
import java.util.ArrayList;

import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML.Attribute;
import javax.swing.text.html.HTML.Tag;
import javax.swing.text.html.HTMLEditorKit.ParserCallback;
import javax.swing.text.html.parser.ParserDelegator;

public class Main {
  public final static void main(String[] args) throws Exception {
    final ArrayList<String> list = new ArrayList<String>();

    ParserDelegator parserDelegator = new ParserDelegator();
    ParserCallback parserCallback = new ParserCallback() {
      public void handleText(final char[] data, final int pos) {
      }

      public void handleStartTag(Tag tag, MutableAttributeSet attribute, int pos) {
        if (tag == Tag.A) {
          String address = (String) attribute.getAttribute(Attribute.HREF);
          list.add(address);
        }
      }

      public void handleEndTag(Tag t, final int pos) {
      }

      public void handleSimpleTag(Tag t, MutableAttributeSet a, final int pos) {
      }

      public void handleComment(final char[] data, final int pos) {
      }

      public void handleError(final java.lang.String errMsg, final int pos) {
      }
    };
    parserDelegator.parse(new FileReader("a.html"), parserCallback, false);
    System.out.println(list);
  }
}

   
    
  








Related examples in the same category

1.Escape HTML special characters from a String
2.Using javax.swing.text.html.HTMLEditorKit to parse html document
3.extends HTMLEditorKit.ParserCallback
4.HTML parser based on HTMLEditorKit.ParserCallback
5.Get all hyper links from a web page
6.Getting the Links in an HTML Document
7.Getting the Text in an HTML Document
8.Find and display hyperlinks contained within a web page
9.Use regular expression to get web page title