Getting data with Excel 2013: using FilterXML for web scraping from Alibaba

As a spreadsheet geek, I was thrilled to find out about the new functionality in Excel 2013, most notably its new webservice and filterXML functions, which seemed meant to put it back on par with Google Docs’ much-acclaimed ImportXML function. Because getting the data where you want it (in your spreadsheet) is half the work.

Unfortunately, both GDocs’ ImportXML and Excel’s Webservice and FilterXML functions have their flaws. GDocs’ ImportXML doesn’t fully support XPath (1.0) — it doesn’t allow you to pick out a second div using /div[2], instead returning a block of anything matching what you asked for, making it hard to get only the parts you wanted exactly where you wanted them. Worse yet, you can only use such functions 50 times per sheet at the time of writing, while there is no native way to import a web page in one cell to parse its contents from other cells either. (Note: neither GDocs nor Excel support scraping from websites requiring authentication, for which you’re better off with Ruby gem Watir.) As such, I turned towards Excel instead.

Challenges in Excel included the following:

  • Any of the existing ‘import data’ functions did not seem scaleable for targeted scraping.
  • Excel currently limits the number of characters per cell to 32,767 characters. That’d seem like a lot, but the web pages I wanted to store in them were more around 77k characters. If you try anyway, you’ll be welcomed by a puzzling #VALUE error.
  • Unlike GDocs’ ImportXML function, Excel’s FilterXML function is pretty picky. Does your web page contain an XML error somewhere? Enjoy your undescriptive #VALUE error; good luck figuring out why you got it!
  • For me using attribute names in the XPath queries (e.g. /div[@class=’stuff’]) did not work on the actual pages I wanted to scrape. Not sure what caused it, would be glad to hear others’ experiences about this.
  • Worse yet, other blogs covering these new functions only seemed to casually go over them, not mentioning these issues or potential workarounds.

The Excel VBA functions (parts and bit assembled from around the internet, then ) I used to overcome these issues were as follows. The showXML/fixXML functions were mine, others probably heavily borrow from code snippets on StackOverflow.

Function importXML(url As String, XPath As String) As String ', Optional trunc As Boolean = False) As String
importXML = WorksheetFunction.FilterXML(showXML(url), XPath)
End Function

Function showXML(url As String, Optional trunc As Boolean = False) As String
Dim s As String
s = ShowHTML(url)
'clean up unnecessary stuff to avoid exceeding the 32k character limit
s = regReplace("", s, "")
s = regReplace("", s, "")
s = regReplace("[a-zA-Z0-9_\-]+=['""]{(.)*?}['""]", s, "")
s = regReplace("[\s]+", s, " ")
s = regReplace("\s*([<>=])\s*", s, "$1")
s = regReplace("<script(.)*?/script>", s, "")
s = regReplace("<meta([^>]*?/>|(.)*?)", s, "")
s = regReplace("<link([^>]*?/>|(.)*?)", s, "")
s = regReplace("<style([^>]*?/>|(.)*?)", s, "")
s = regReplace("<noscript([^>]*?/>|(.)*?)", s, "")
s = regReplace("<header([^>]*?/>|(.)*?)", s, "")
s = regReplace("<footer([^>]*?/>|(.)*?)", s, "")
s = regReplace(" style\s*=\s*(""(.)*?""|'(.)*?')", s, "")
s = regReplace(" title\s*=\s*(""(.)*?""|'(.)*?')", s, "")
s = regReplace(" alt\s*=\s*(""(.)*?""|'(.)*?')", s, "")
s = regReplace(" rel\s*=\s*(""(.)*?""|'(.)*?')", s, "")
s = regReplace(" target\s*=\s*(""(.)*?""|'(.)*?')", s, "")
s = regReplace(" type\s*=\s*(""(.)*?""|'(.)*?')", s, "")
s = regReplace(" data-[a-zA-Z_\-]*?\s*=\s*(""(.)*?""|'(.)*?')", s, "")
s = Replace(s, "&", "&")
s = fixXML(s)
's = WorksheetFunction.FilterXML(s, XPath)
If (trunc = True) Then s = truncate(s)
showXML = s
End Function

Public Function ShowHTML(strURL As String, Optional trunc As Boolean = False) As String
Dim oXMLHTTP As New MSXML2.XMLHTTP
With oXMLHTTP
.Open "GET", strURL, False
.send ""
If .Status = 200 Then
ShowHTML = .responseText
Else
ShowHTML = .statusText
End If
End With
ShowHTML = Trim(ShowHTML)
Set oXMLHTTP = Nothing
If trunc Then ShowHTML = truncate(ShowHTML)
End Function

Public Function fixXML(s As String) As String 'fix malformed XML
Dim i As Long, toPos As Long, openClose As Long, kind As String, matches As Variant, varKind As String, startAttr As Long, layer As Integer, attrs As New Scripting.Dictionary, tagStack As New Scripting.Dictionary, deepestOcc As String, j As Integer, spacePos As Long
i = 0
layer = 0
Do
i = InStr(i + 1, s, "<") 'start tag
If i < 1 Then Exit Do 'no more tags, all done 'kill superfluous tags openClose = InStr(i, s, ">") 'end tag
soFar = Left(s, openClose - 1)
If (Mid(s, i + 1, 1) = "/") Then 'closing tag
kind = Trim(Mid(s, i + 2, openClose - i - 2)) 'tag type
deepestOcc = GetKey(tagStack, kind, True)
If (deepestOcc = "") Then 'if no matches
Debug.Print (String(layer, " ") & "-" & kind & " (removed)")
s = Left(s, i - 1) & Mid(s, openClose + 1) 'remove superfluous closing tag
i = i - 1
Else '(assumes forgotten close -- dangerous assumption)
Do While (layer <> CInt(deepestOcc))
s = Left(s, i - 1) & "" & Mid(s, i) 'append forgotten closes
i = i + Len(tagStack.Item(layer)) + 3
Debug.Print (String(layer, " ") & "-" & tagStack.Item(layer) & " (forced)")
tagStack.Remove (layer) 'remove tag from hierarchy stack
layer = layer - 1
Loop
'If kind = tagStack.Item(layer) Then 'the close tag matches
Debug.Print (String(layer, " ") & "-" & kind)
tagStack.Remove (layer) 'remove tag from hierarchy stack
layer = layer - 1
'End If
End If
Else 'opening tag, remove if unclosed
If InStr(i, s, "/>") > 0 And InStr(i, s, "/>") < openClose Then GoTo nextIteration 'empty tags don't need closing
spacePos = InStr(i, s, " ")
If spacePos < 1 Then spacePos = 65000 'should exceed document size
kind = Trim(Mid(s, i + 1, WorksheetFunction.Min(spacePos, openClose - 1) - i)) 'tag type
layer = layer + 1
Debug.Print (String(layer, " ") & "+" & Mid(s, i + 1, openClose - i - 1))
Call tagStack.Add(layer, kind)
'If InStr(i, s, " 0) 'close remaining tags on stack
Debug.Print (String(layer, " ") & "-" & tagStack.Item(layer) & " (forced cleanup)")
s = s & "" 'append forgotten closes
tagStack.Remove (layer) 'remove tag from hierarchy stack
layer = layer - 1
Loop
fixXML = s
End Function

Public Function truncate(ByRef s As String)
Const limit As Integer = 32767
If (Len(s) > limit) Then
MsgBox Len(s) & " > " & limit & ", truncating!"
s = Left(s, limit)
End If
truncate = s
End Function

Public Function regReplace(patt As String, inText As String, withText As String) As String
Dim regEx
Set regEx = CreateObject("vbscript.regexp")
With regEx
.MultiLine = True
.Pattern = patt
.Global = True
End With
regReplace = regEx.Replace(inText, withText)
End Function

Public Function getMatches(s As String, reg As String) As Variant
Dim regEx
Set regEx = New RegExp
regEx.Pattern = reg
regEx.ignorecase = False
regEx.Global = True
Set getMatches = regEx.Execute(s)
End Function

Function GetKey(dic As Scripting.Dictionary, strItem As String, Optional last As Boolean = False) As String
Dim key
GetKey = ""
For Each key In dic.keys()
If dic.Item(key) = strItem Then
GetKey = key
If (last = False) Then Exit Function 'sucks, still asc.
End If
Next
End Function

Usage guide: use the above ShowXML function to import the source of a web page where the Webservice function fails (setting the optional truncate parameter to True where necessary to dodge the 32k character limit), then pick out the good snippets from that using FilterXML. My sample spreadsheet can be found here.

The result? I managed to use as an input only one Alibaba product listing URL for a page like this:

Screenshot (2)

… and use it to scrape all product details for all its listed products from their individual pages, like this one:

Screenshot (3)

I guess the next step for me would to write a VBA function to automate scraping a bunch of links this way, then dumping the contents in one big table. So far the only issue for me was the clunky XPath though. Has anyone else figured out how to fix the errors for attribute references (forcing me to use ten /div’s instead of the more desirable “//a[@class=’mc-click-target’]/@href”)?

Advertisements
This entry was posted in Uncategorized and tagged , , , , , , , . Bookmark the permalink.

8 Responses to Getting data with Excel 2013: using FilterXML for web scraping from Alibaba

  1. nate says:

    I have a job for you! need products from 10 sites this week…

  2. Bugs says:

    I tried to download a copy of your examples, but they are gone. Have you moved on to something else or do you have a better idea of how to use FilterXML ?

    • tycho01 says:

      Hi Bugs,
      I haven’t really taken the time to blog anymore, but have tried more scraping since then. In retrospect though, the XPath / DOM parsing I tried at the time was pretty horrible, though with GDocs in mind it seemed like the logical choice. Now I’ve switched from extracting content by FilterXML to just directly extracting using regular expressions; the posted ShowHTML and getMatches functions are actually pretty much the core of what I use still.
      Aside from those though, Power Query can also be particularly useful, especially when extracting data from HTML tables for many similar pages.
      Would those help for you to get started? What’s the use case you’re trying to solve?

  3. Al says:

    How can I fix the compile error “Loop without Do” in the fixXML function

    • tycho01 says:

      Hi Al,
      I’m not getting this (just retested) — did you change anything about the function? The error implies the do-loop structures in the function don’t add up (a loop without a do). They are balanced though, with two of each, so this shouldn’t happen. Would you mind elaborating?

      • Al says:

        Yes, I also think the do-loop are matched, but when it runs, the said error pops up and also highlight the second ‘loop’ – the one close to the end of the function. I copied and paste many times to make sure there is no error in transferring your code into my excel VBA module. I have tested on Excel 2010 and 2013. May I know what version of Excel are you doing the retesting? I have also fiddled with the IF and End IF but that does not help.

        I don’t think it should be related to the my data, but you can try it with the following string as url:
        http://clinicaltrials.gov/show/NCT00079027?displayxml=true
        and then do
        =importXML(url,”//secondary_outcome”)

  4. tycho01 says:

    Hi Al, the version I’m using is 2013. I’m not getting that error, though it’s also not working properly — the instability of this method is also the reason I’ve since abandoned it for regex parsing and Power Query.
    Considering this is pure XML, I’m thinking Power Query may be a great option as well. Aside from fetching web pages, it should also allow you to parse XML structures. I’m not sure if you’re familiar with it, and it’s not exactly the same as ImportXML, but I think it should do what you want. Would that help you in the right direction?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s