{"id":5065,"date":"2012-11-21T11:30:18","date_gmt":"2012-11-21T00:30:18","guid":{"rendered":"http:\/\/mingersoft.com\/blog\/?p=5065"},"modified":"2012-11-20T20:55:17","modified_gmt":"2012-11-20T09:55:17","slug":"scraping-html-tables-into-sql-server-using-ssis","status":"publish","type":"post","link":"https:\/\/mingersoft.com\/blog\/2012\/11\/scraping-html-tables-into-sql-server-using-ssis\/","title":{"rendered":"Scraping HTML Tables into SQL Server Using SSIS"},"content":{"rendered":"<p>At work, I am attempting to scrape a table off an intranet webpage so that I can import its contents into Microsoft SQL Server so I can obtain more frequent updates for a small but critical subset of data that I would normally only receive once a day. Using this information I would also be looking to provide some other useful statistics but without this base information the rest is moot at the time being.<\/p>\n<p>There are some distinct pieces to the puzzle here, specifically:<\/p>\n<ol>\n<li>downloading a copy of the webpage,<\/li>\n<li>correcting any errors in the underlying HTML,<\/li>\n<li>parsing the HTML into XML,<\/li>\n<li>importing the XML into SQL Server using SSIS.<\/li>\n<\/ol>\n<p>The final step is fairly easy as SSIS provides native functionality to import XML into a database whilst the first step is really simple as I have an existing script that does the trick. The sticking points are the second and third pieces.<\/p>\n<p>I have found a nifty utility called <a title=\"HTML Tidy\" href=\"http:\/\/tidy.sourceforge.net\/\" target=\"_blank\">HTML Tidy<\/a>\u00a0that takes care of fixing up badly formed HTML by adding in missing tags. This can be invoked on a command line with the output overwriting the original file or to a new file. I chose to write a batch script with the specific command that I want to run and then had that batch file called by a process task in the SSIS control flow.<\/p>\n<p>The final piece in the puzzle is parsing HTML into XML. I&#8217;ve found a script online that looks to be on the right track but requires a little work to do what I want it to do. I&#8217;m not a great C# programmer by any means (in fact, I don&#8217;t claim to know how to program in C#) but I am sure I can fudge something to do the trick.<\/p>\n<p>This is one of those nice to do things so it&#8217;s not high on my list of priorities but I&#8217;ll provide an update when I get some more time to solve this problem.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>At work, I am attempting to scrape a table off an intranet webpage so that I can import its contents into Microsoft SQL Server so I can obtain more frequent updates for a small but critical subset of data that I would normally only receive once a day. Using this information I would also be &hellip; <\/p>\n<p><a class=\"more-link btn\" href=\"https:\/\/mingersoft.com\/blog\/2012\/11\/scraping-html-tables-into-sql-server-using-ssis\/\">Continue reading<\/a><\/p>\n","protected":false},"author":1,"featured_media":4214,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[18],"tags":[1081,2246,1919,1967,201],"class_list":["post-5065","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-technology","tag-html","tag-microsoft","tag-sql-server","tag-ssis","tag-xml","item-wrap"],"jetpack_featured_media_url":"https:\/\/i0.wp.com\/mingersoft.com\/blog\/wp-content\/uploads\/2012\/05\/SQL-Server-Logo.png?fit=247%2C267&ssl=1","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/mingersoft.com\/blog\/wp-json\/wp\/v2\/posts\/5065","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mingersoft.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mingersoft.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mingersoft.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mingersoft.com\/blog\/wp-json\/wp\/v2\/comments?post=5065"}],"version-history":[{"count":0,"href":"https:\/\/mingersoft.com\/blog\/wp-json\/wp\/v2\/posts\/5065\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/mingersoft.com\/blog\/wp-json\/wp\/v2\/media\/4214"}],"wp:attachment":[{"href":"https:\/\/mingersoft.com\/blog\/wp-json\/wp\/v2\/media?parent=5065"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mingersoft.com\/blog\/wp-json\/wp\/v2\/categories?post=5065"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mingersoft.com\/blog\/wp-json\/wp\/v2\/tags?post=5065"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}