Java: Using relative expressions and a scanner to read HTML?

Derek · Mar 2, 2009

I am very unfamiliar with regular expressions. From what I understand, \Q and \E are sort of as a String is to a characters. A delimiter of \Q and \E will consider everything between the \Q and \E as a single delimiter. With that said, I want to find all the links within a specified web page.

I'm using a simple HTML file (on my PC) with a few links and text for debugging purposes.

The problem is, I want to ignore everything before the <a tag, and everything after the </a> tag, and I can't seem to express this with regular expressions.

Anyone able to help? I've tried several variations of a regular expression within useDelimiter(). Here is my code:

webpage.useDelimiter("\Q?<a\E|\Q</a>\E");

while(webpage.hasNext()) {
currentLine = webpage.next();
if(!currentLine.equals("")) {
temp += currentLine;
println("Line " + i + ": " + currentLine);
i++;
}
}

This recent variation is getting me closer than ever; the output is:

Line 1: <html><body>Hello there. Here is a link:<a href="http://www.google.com">Google
Line 2: <a href="http://www.engadget.com">Engadget
Line 3: <a href="http://www.youtube.com">YouTube
Line 4: </body></html>

I just need to ignore everything outside of the anchor tag (<a>), and include the final </a>.

Thanks for your responses -- or even for just reading this short novel,
Derek
Yahoo ...'ed my regular expression. It was \Q?<a\E|\Q</a>\E
Whoops, I accidentally named my question wrong! It should read:

Java: Using REGULAR expressions and a scanner to read HTML?

Java: Using relative expressions and a scanner to read HTML?

Derek

Guest