March 6, 2003

I did a little bit of work on the web-update component but didn’t get around to testing it since I got involved in another project :p A friend of mine wanted a data-mining application written that would allow him to select some information on a web page and then have the program identify the different elements in the selection and populate a CSV file with the information so that it can probably be imported into a database. Now, I wasn’t planning to write a customizable data-miner where you can specify how each field was to be derived – sort of like Snipper if you recall. Rather, I was going to write a custom data-miner which would work for this particular web page … but I am toying around with the idea of a customizable utility in the back of my mind … just so you know :p

I decided to go with Visual Studio .NET and C# for the app since I’d done all of my previous HTML parsing stuff in Delphi and had always run into one particular problem – being able to parse the HTML in a selection as HTML tags and attributes. I thought this was due to a limitation in Delphi’s support for some of the MS HTML handling interfaces and so decided to go with C# where MS would have made sure that it supported the latest :p The UI itself came together pretty fast but I still prefer Delphi’s RAD IDE – MS just hasn’t learnt enough from their competition. While the Visual Studio .NET IDE is pretty good and makes UI development much faster, they don’t have (or at least *I* couldn’t find ..) some of the elements that make UI development such a pleasure with Delphi – simple things like being able to link the menu component to the statusbar component so that menu hints are displayed on the statusbar … little things like that. Plus, C#’s toolbar stuff is abysmal – you can’t write event code for each toolbar button, you just have to figure out which button was clicked from a generic toolbar button clicked event and then carry out the specific action for that particular button. I hated that … but I digress …

I managed to get the UI going pretty fast and then came the actual data extraction stuff. The UI had a web browser component where the user selected the data that they wanted to extract and the identification of the selected portion on the web browser view was simplicity itself since I’d done it many times before – all I had to to do was get an IHTMLDocument2 interface for the browser component and then get the selection property of the interface to get an IHTMLSelectionObject interface. Now came the tough part, actually parsing the stuff. I’d assumed before (not so sure that’s how it should be done now though …) that if I was able to create a control range using the createRange method for the selection, that I’d be able to get a list which would have the HTML tags and their attributes neatly separated. This was what I’d been trying to do in Delphi a couple of times before but always ended up failing. I thought it would work in C# but what do you know? I failed again :p So, I’m beginning to think that maybe it’s my approach which is flawed.

Anyway, since that approach didn’t work and a few others I tried didn’t either, I decided to go back to manually parsing the selected HTML. I did find something that I had not known before – that the IHTMLTxtRange interface created by createRange had both an HTMLText property as well as a Text property. I’d always been using the Text property before but now I used the HTMLText property since I had found that most of the information I needed was in table cells – so all I had to do was isolate the table cell information and work with that. I wrote a new function to isolate just the text in table cells and then the rest was just a matter of coding … It all works fine now – at least I think so but I’ll know for sure once my friend has had a go at the final app …

Tags:
Posted by Fahim at 6:55 am  |  No Comments