Web Data Extraction with C++ Web Macro
Web data extraction or web scraping can be implemented in various ways. Today I will use Twebst Web Automation Library to extract search results from Google using DOM parsing method and Internet Explorer automation (you need to install Twebst Library first).
Here are the steps that C++ web macro will perform in order to extract results from Google search:
- Open an Internet Explorer browser and navigate to Google site.
- Find the search edit box and fill out the word to search.
- Find the submit button and click it.
- Wait until the page is loaded and find a DIV with id=res
- Find the collection of all H3 elements inside the DIV element.
- Extract the text and URL and display it.
Enough talk! Let the code speak for itself.
// Start a new Internet Explorer instance and navigate to a given URL.
IBrowserPtr pBrowser = pCore->StartBrowser("http://www.google.com/");
// Find search edit box in page and type some text into it.
IElementPtr pSearchEdit = pBrowser->FindElement("input text", SearchCondition("name=q"));
pSearchEdit->InputText("codecentrix");
// Find search button and click it.
IElementPtr pSearchBtn = pBrowser->FindElement("input submit", SearchCondition("text=Google Search"));
pSearchBtn->Click();
// Find the DIV element where the result are displayed.
IElementPtr pResultDiv = pBrowser->FindElement("div", SearchCondition("id=res"));
// Get all found results and print them in console.
IElementListPtr pResultList = pResultDiv->FindAllElements("h3", SearchCondition());
// Display only the header result (text and url).
for (int i = 0; i < pResultList->length; ++i)
{
// Get current H3 in the list.
IElementPtr pCrntResult = pResultList->Getitem(i);
// Find first and only anchor inside H3
IElementPtr pCrntAnchor = pCrntResult->FindElement("a", SearchCondition());
CComQIPtr<IHTMLAnchorElement> spCrntAnchor = pCrntAnchor->nativeElement;
// Get URL from IHTMLAnchorElement.
CComBSTR bstrURL = "";
spCrntAnchor->get_href(&bstrURL);
// Display results.
wcout << pCrntResult->text << L"\n" << bstrURL.m_str << L"\n\n";
}
Download:
- CppMacro.zip (full source code and exe)
- Twebst Web Automation Library
IE Web Login Automation
One highly repetitive web task is the logon to a web site. This is a common scenario where Twebst Web Automation Library really shines. Here is a short web macro written in JScript language that automatically logs you on Yahoo Mail site. All you have to do is to replace “UUUUUUUUUU” and “PPPPPPPPPP” with your user name and password in the code below.
// Open a browser and navigate to yahoo mail login page.
var core = new ActiveXObject("Twebst.Core");
var browser = core.StartBrowser("https://login.yahoo.com/config/mail?.intl=us");
// Find login fields.
var u = browser.FindElement("input text");
var p = browser.FindElement("input password");
var s = browser.FindElement("input submit");
// Log on to site by filling the user-name and password fileds and then click submit boutton.
u.InputText("UUUUUUUUUU");
p.InputText("PPPPPPPPPP");
s.Click();
FindElement searches thru all frames/iframes hierarchy for the first input element of type text/password/submit. Additional conditions can be specified for search (like searching an element by id/name or any other HTML attribute). Search conditions can make use of regular expressions if needed.
One more important thing is that FindElement method waits for the web page to be completely loaded before searching the element (the timeout can be specified by using core.loadTimeout property). Read more about Twebst Library…
Download:
- Twebst Library Pro (it must be installed for the macro to work).
- YahooMail.js
- Script Of The Day system if you want to easily install more login templates.
Twebst Web Automation Library v1.40 released
Twebst version 1.40 is launched!
Main changes include IE8 compatibility, better support for working with embeded IE browser control, support for modal and modeless HTML dialogs and functions for clipboard access.
Here is the list of new features and enhancements:
- NEW: IE8 is now supported
- ENH: core.AttachToNative* methods work now with hosted IE browser control
- BUG: various fixes
- NEW: core.foregroundBrowser property
- NEW: core.productName property
- NEW: core.productVersion property
- NEW: core.GetClipboardText method
- NEW: core.SetClipboardText method
- NEW: core.AttachToWnd method
- NEW: core.NativeWindowToNativeBrowser method
- NEW: core.NativeWindowToNativeDocument method
- NEW: core.NativeWindowToNativeDocument
- NEW: browser.FindModalHtmlDialog method
- NEW: browser.FindModelessHtmlDialog method
- NEW: element.GetAttribute method
- NEW: element.SetAttribute method
- NEW: element.RemoveAttribute method
- NEW: element.tagName property
- NEW: element.FindParentElement method
- NEW: core.RightClick method-
- Find more …
Homemade Handcrafted Help System
Here is the solution:
- The template is an XML document. When documenting an object/method or property the focus is on the content rather than on formatting the text. There is one XML file for each object/method/property.
- A WSH script written in jscript parses the XML document and adds syntax highlighting to sample code in the documentation page. Regular expression are used for parsing.
- cross references are added automatically by the same script.
- then a XSL transformation is applied to convert XML source to a HTML document that will be eventually written to disk.
- The whole process is optimized by removing unnecessary operations like generating the HTML when it already exists and is newer than its XML source.
- Finally the HTML documents refers a CSS style sheet to easily change the look.
It goes like this:
XML + JScript-> XML with color syntax and cross references + XSL -> HTML + CSS -> CHM
For local help, the CHM compiler is invoked as a final step and a CHM Help File is generated. All you have to do is launching Build.js script you may find in the archive below.
Downloads: TwebstHelp.zip
Prerequisites: In order to build the CHM file you’ll need HTML Help Workshop from Microsoft.
WSH and clipboard access
I did some Windows Script Host programming recently and I was pleasantly surprised by its power, features and flexibility. One thing that I couldn’t accomplish was accessing the clipboard from WSH. Digging the internet I found some solutions like this one based on Internet Explorer Automation. There are several problems with this approach as you can read in my article about Internet Explorer Automation: What’s wrong with Internet Explorer Automation?
My solution for scripting the clipboard content in WSH is a regular COM object created with VC++ and ATL.
Download full source code and compiled DLL: WSH_clipboard.zip
To install the COM object run register.bat
I found scripting the clipboard useful enough to add this feature to the next release of Twebst Web Automation Library.
Free Web Macros for Internet Explorer
As I presented in my previous post, automating Internet Explorer can be a difficult task.
Twebst Web Automation Library can make things easier.
Get it FREE!
What Twebst can do?
- increase productivity by automating repetitive web tasks
- automate regression testing of web applications
- automate web actions and data-entry
- automatically log in to different web sites
- fill out web-forms automatically
- extract data from web pages (web scraping).
- monitor web pages
Twebst features
- Start new browsers and navigate to a specified URL.
- Connect to existing browsers.
- Search and access HTML elements and frames inside browsers.
- Intuitive names for HTML elements using the text that appears on the screen.
- Advanced search of browsers and HTML elements using regular expressions.
- Perform actions on all HTML controls (button, combo-box, list-box, edit-box etc).
- Simulates user behavior generating hardware or browser events.
- Get access to native interfaces exposed by Internet Explorer so you don’t need to learn new things if you already know IE web programming.
- Synchronize web actions and navigation by waiting the page to complete in a specified timeout.
- Available from any programming or script language that supports COM
- Optimized search methods and collections.
What’s wrong with Internet Explorer Automation?
Though Internet Explorer browser is not part of the Office suite, it supports automation. Here is a short sample:
// Create an IE automation object.
var ie = new ActiveXObject("InternetExplorer.Application");
// Make it visible and navigate to a given URL.
ie.Visible = true;
ie.Navigate("http://www.google.com/");
// Give it some time to load the page and then get the document.
WScript.Sleep(3000);
var doc = ie.Document;
// Fill out search field.
var edit = doc.getElementsByName("q").item(0);
edit.value = "codecentrix";
// ... and press the submit button.
var submit = doc.getElementsByName("btnG").item(0);
submit.click();
Here is ie_auto.js file for download.
However there are problems with Internet Explorer automation:
- it may not work at all on Windows Vista unless the script is running at the same integrity level as iexplore.exe process. Simply clicking the js file won’t do it. The script will run at medium integrity level and Internet Explorer has low integrity level and as result the script fails. If you run the script at high integrity level the newly started IE instance will have the same high integrity level and the script works (but this is not the best option from a security point of view). Changing the integrity level of the running script (or application) is not always the most desirable or easiest thing to do.
- no support to “connect” to already existing IE documents.
- sub-documents in different domains are not accessible for scripting due to cross scripting security issues (see my older posts: “When IHTMLWindow2::get_document returns E_ACCESSDENIED” and “When IHTMLWindow2.document throws UnauthorizedAccessException“).
- difficult search of elements across all sub-documents inside frames/iframes (and sometimes impossible, see the point above).
- difficult and time consuming search of HTML elements on attributes other than id or name (getElementById and getElementsByName are the only methods I know that search elements directly wihtout browsing element collections which might be very slow when performed out of process).
- no tab support (only launch a new tab).
- no direct support for synchronizing input actions (clicks, keys) with the HTML document loading (it could be implemented by registering to IE events like document complete or looping while the browser becomes ready to accept inputs).
- no advanced search criteria like regular expression or searching on multiple attributes.
If you are interested in solving the issues above, let me introduce a project I’ve been working on for some time now. Here’s Twebst, web automation library for Internet Explorer!
Get it FREE!
focus vs fireEvent("onfocus")
While working on Twebst web automation library I encountered this problem: how to simulate setting the focus on HTML edit controls in Internet Explorer? There are two ways to do this.
- Call IHTMLElement2::focus() method on target element that "causes the element to receive the focus and executes the code specified by the onfocus event".
- Rise onfocus event on target element by calling IHTMLElement3::fireEvent() method.
The two approaches are quite similar but there are some interesting differences.
- fireEvent("onfocus") does not actually set the focus on the element, it just executes the code of the onfocus handler event.
- Calling focus method sets the focus on target element and call the onfocus event handler but not immediately. The onfocus event seems to be inserted in a queue and its handler is executed asynchronously after the current handler is finished.
- If focus method is called from inside the onfocus handler nothing happens if the control already has the focus (that prevents an infinite recursion).
Example:
<html>
<script type="text/javascript" language="javascript">
function BtnFocusClick()
{
document.getElementById('editTest').focus();
window.status += "b";
}
function BtnOnFocusClick()
{
document.getElementById('editTest').fireEvent('onfocus');
window.status += "c";
}
function EditOnFocus()
{
window.status += "a";
}
</script>
<body>
<input type="text" onfocus="EditOnFocus()"; id="editTest"/><br/>
<input type="button" value="focus" id="btnFocus" onclick="BtnFocusClick();"/>
<input type="button" value="fire onfocus" id="btnOnFocus" onclick="BtnOnFocusClick();"/>
</body>
</html>
If pressing the button "fire onfocus" button the message in the Internet Explorer status bar is the expected one "ac". If pressing the "focus" button, the message is in reverse order than expected: "ba". That suggests that EditOnFocus handler is called after BtnFocusClick exit.
Repairing Internet Explorer
1). ‘Open in New Window’ Command Does Not Work in Internet Explorer
3). When you open a new tab in Internet Explorer 7, does the You’ve opened a new tab message appear every time, and are you unable to turn off this option
4). Open a link in a new tab takes forever to load.
I’ll come later with examples of how an extension can break IE and how to avoid it. For now let’s concentrate on how we can fix it. Download RepairIE.zip file, extract files inside and run “fixie.cmd“
Points of interest:
- Internet Explorer components are registered using regsvr32.exe tool.
- For IE7 mshtml.dll can not be registered as explained above; instead mshtml.tlb is registered using regtlib.exe tool.
- Registering shdocvw.dll using regsvr32.exe is not a good idea on IE7 because it damages [HKEY_CLASSES_ROOT\Typelib\{EAB22AC0-30C1-11CF-A7EB-0000C05BAE0B}\1.1\win32 registry key.
- reg_ieframe.reg file fixes the broken key above.
Allow local scripted HTML files to run in IE7
During automatically testing Twebst library I was very annoyed about IE7 refusing to properly open local HTML files that contains scripts. The following message is displayed:
“To help protect your security, Internet Explorer has restricted this web page from running scripts or ActiveX controls that could access your computer. Click here for options…”
It took me several minutes to find the hidden option that turns off this warning. First I looked for it in Security tab options but it was actually in Advanced tab. Here’s how you find it:
1). Go to:
Tools > Internet Options > Advanced > Security
2). Check:
Allow active content to run in files on My computer
3). Restart IE
leave a comment