HTML is actually a subset of XML thus you can use an xml parser to parse the HTML string/document you have.
There are two ways to parse a file, using a SAX parser or a DOM parser. Each take a different approach to parsing, the SAX parser is a stream based parser and reads each line of the file and parses it, the DOM approach loads the whole tree into memory and then tries to parse it.
Depending on your html size it will determine which way to parse the file.
I'd say go for SAX parser as then you don't have to change it when your file grows in size.