Scraping password protected sites

Using (or providing) Microsoft.NET Classes

Scraping password protected sites

Postby neeraj on Fri Jul 24, 2015 6:12 am

How would you do this in Dyalog?

Code: Select all
__author__ = 'ngupta'
from bs4 import BeautifulSoup
import mechanize

LOGIN_URL = "https://www.schwab.com/"
LOGIN_FORM_NAME = "SignonForm"
LOGIN_USER_ID_FIELD = "SignonAccountNumber"
LOGIN_PASSWORD_FIELD = "SignonPassword"
"""Create browser"""
mech_br = mechanize.Browser()
mech_br.set_handle_robots(False)
mech_br.set_handle_refresh(False)
mech_br.addheaders = [('User-agent', 'Firefox')]

user_id="your_id"
password="your_pwd"
mech_br.open(LOGIN_URL)
mech_br.select_form(name=LOGIN_FORM_NAME)
mech_br[LOGIN_USER_ID_FIELD] = user_id
mech_br[LOGIN_PASSWORD_FIELD] = password
login_response = mech_br.submit()

soup = BeautifulSoup(login_response.read(),"html.parser")
table = soup.find("table", {"id": "tblCharlesSchwabBank"})
balance = float(table('tr')[1]('td')[2].span.text[1:])  # 2nd row, 3rd cell
print balance
neeraj
 
Posts: 78
Joined: Wed Dec 02, 2009 12:10 am
Location: Ithaca, NY, USA

Re: Scraping password protected sites

Postby neeraj on Fri Jul 24, 2015 6:17 am

RUNNING THE SCRIPT:

/System/Library/Frameworks/Python.framework/Versions/2.7/bin/python2.7 "/Users/ngupta/Dropbox/python/pycharm projects/MechanizeTest/Test4.py"
698.53

Process finished with exit code 0
neeraj
 
Posts: 78
Joined: Wed Dec 02, 2009 12:10 am
Location: Ithaca, NY, USA

Re: Scraping password protected sites

Postby Vince|Dyalog on Tue Jul 28, 2015 11:14 am

Hi Neeraj,

I would suggest searching for the internet for "c# web scrape login" and then translating c# examples into APL using our .NET interface.

Regards,

Vince
Vince|Dyalog
 
Posts: 280
Joined: Wed Oct 01, 2008 9:39 am

Re: Scraping password protected sites

Postby PGilbert on Tue Jul 28, 2015 3:23 pm

Based on the suggestion of Vince and this web page: http://webdata-scraping.com/login-website-programmatically-using-c-web-scraping/ you can do the following in .Net:

Code: Select all
 url←'https://www.schwab.com/'

 ⎕USING←'System.Windows.Forms,System.Windows.Forms.dll'
 ⎕USING,←⊂'System.Drawing,System.Drawing.dll'

 wb←⎕NEW WebBrowser
 wb.Dock←wb.Dock.Fill
 wb.Navigate(⊂url)
 ⎕DL 5
 htmlDoc←wb.Document
 html←⎕UCS wb.DocumentStream.ToArray

 signonAcc←htmlDoc.GetElementById(⊂'SignonAccountNumber')
⍝ signonAcc.InnerText←'user_id' ⍝ No error but property is not changed
 signonAcc.InnerHtml←'user_id'

 signonPwd←htmlDoc.GetElementById(⊂'SignonPassword')
⍝ signonPwd.InnerText←'password' ⍝ No error but property is not changed
 signonPwd.InnerHtml←'password'

 loginBtn←htmlDoc.GetElementById(⊂'&lid=Log in')
 loginBtn.InvokeMember(⊂'click')

 ⍝ Show the WebBrowser in a WindowsForm
 fm←⎕NEW Form
 fm.Size←⎕NEW Size(1100,680)
 fm.Text←'URL [ ',url,' ]'
 fm.onClosed←'_GetWebResults_onClosed'
 fm.Controls.Add wb

 fm.Show ⍬

and for the onClosed event function:
Code: Select all
 _GetWebResults_onClosed(sender event)

 (⌷sender.Controls).Dispose


This is working code that is not bugging but you will have to try it with your ID and Password. 'htmlDoc' is a System.Windows.Forms.HtmlDocument that you can interrogate easily with .GetElementById or .GetElementsByTagName . You find those ID and TagName by inspecting manually the html of the page or if you use Safari you can right click on an element of the page and on the contextual menu you choose 'Inspect Element' and it will show you the HTML of that element and finds its ID more easily. Sometimes you may need to put ⌷ or ⍬⍴⌷ in front of the result of .GetElementById or .GetElementsByTagName to get it in the proper rank.

Good luck.
User avatar
PGilbert
 
Posts: 361
Joined: Sun Dec 13, 2009 8:46 pm
Location: Montréal, Québec, Canada

Re: Scraping password protected sites

Postby neeraj on Thu Jul 30, 2015 4:09 am

Thanks to both of you. I will try and see how it works out.
neeraj
 
Posts: 78
Joined: Wed Dec 02, 2009 12:10 am
Location: Ithaca, NY, USA


Return to Microsoft.NET

Who is online

Users browsing this forum: No registered users and 1 guest