HTML Parsing (MMF only)
Author: | Keatontech!
|
Submitted: | 9th April, 2006
|
Views: | 4928
| Rated: |
|
|
Well, i've just noticed im about 6 DC points short of 100, which I think, I hope, is when you can change your rating. I think we'll be seeing a lot more articles, downloads, and reviews now that we can see our DC points.
Anyway, you all probably know that I submitted TDC Downloads Downloader and the HTML parsing engine (which is just the TDC Downloads Downloader source). Well, I've made many more mini-apps that I have not submitted as they do things like get my homework from my teachers mini-site. I have kinda worked out a system. I will guide you through the steps to make your own custom app using my HTML Parsing Engine.
Ok, so let's get started. As i'm sure you all know, you can view the HTML of a page by going to View > Page Source in your browser. View the source of a page that has things (like news) that you want to have in your app. Look through the source until you find the place where the text you want is. Look at all the lines that have the text, see if you can find some HTML code that is in all of those lines. When you have code this is somewhere in all of those lines, search for it in other parts of the HTML using Edit > Find. In firefox you can click highlight all, in other browsers you may have to keep clicking find next. If the program finds the code on lines that don't have the text you want, you'll have to find other code that is only in the lines that you want. If not, move on. Look at one of the lines with the text you want and think of an equasion that would parse that out.
Open the HTML Parser Engine in MMF. You'll see a lot of objects, here's what they do:
TextExtractHTML:
Loads the HTML file.
TextExtractLines:
Stores the line numbers in the HTML that have the the text on them.
List 3:
Oops, i forgot to name this one. List 3 stores and saves the extracted text.
TextExtractLinks:
This one stores links that have been extracted from the HTML. This list can be deleted if you don't need two things extracted from the HTML.
InfoHTML:
This one stores the second HTML file. This list can be deleted if you don't need multiple pages scanned.
TextEditRTF:
This RTF object is used to search each line of the list object for the code that appears on the lines you want.
InfoStructureRTF:
This scans the second page of HTML. You can delete this object if you are only scanning one page of HTML.
Progress:
This simply displays what the program is doing.
InfoStructureDloadLine:
This stores the line of the text on the second page. You can delete this if you only have one page to scan, or if your second page has more than one line with text on it.
I think that's all of them. For your first parser program you should probably only have one page to scan with one thing to extract, so delete TextExtractLinks, InfoHTML, InfoStructureRTF, and InfoStructureDloadLine.
Go to the Start of Frame line (1). Change the WebGrab object's download link to your HTML file's link. It is simple whatever is displayed in your address bar. You can customise the messages that are displayed by the Progress Edit object if you want. Now select Line 9. Under the TextExtractRTF Rich Edit object edit the find command to the text you found that is only in the lines you want. On Line 10 change the GetSelection condition to GetSelection$( TextExtractRTF )is different than . Now you can go to Line 17. Under the List 3 List Object edit the add line action and make the text it adds the token you figured out at the beginning. If you need any help figuring the token out, post a line of HTML with the text you want on it below.
Well, I think that's it. If you need any additional help you can post your problem below, DCMail me, or Email me at keatontech@Keatontech.com. Chao.
|
|
Keatontech!Possibly Insane
Registered 10/07/2005
Points 2720
|