At work, I am attempting to scrape a table off an intranet webpage so that I can import its contents into Microsoft SQL Server so I can obtain more frequent updates for a small but critical subset of data that I would normally only receive once a day. Using this information I would also be looking to provide some other useful statistics but without this base information the rest is moot at the time being.
There are some distinct pieces to the puzzle here, specifically:
- downloading a copy of the webpage,
- correcting any errors in the underlying HTML,
- parsing the HTML into XML,
- importing the XML into SQL Server using SSIS.
The final step is fairly easy as SSIS provides native functionality to import XML into a database whilst the first step is really simple as I have an existing script that does the trick. The sticking points are the second and third pieces.
I have found a nifty utility called HTML Tidy that takes care of fixing up badly formed HTML by adding in missing tags. This can be invoked on a command line with the output overwriting the original file or to a new file. I chose to write a batch script with the specific command that I want to run and then had that batch file called by a process task in the SSIS control flow.
The final piece in the puzzle is parsing HTML into XML. I’ve found a script online that looks to be on the right track but requires a little work to do what I want it to do. I’m not a great C# programmer by any means (in fact, I don’t claim to know how to program in C#) but I am sure I can fudge something to do the trick.
This is one of those nice to do things so it’s not high on my list of priorities but I’ll provide an update when I get some more time to solve this problem.