Source Code Example - Scraping a Web Page

Using (or providing) Microsoft.NET Classes

Source Code Example - Scraping a Web Page

Postby neeraj on Thu Apr 17, 2014 4:52 am

The following may be of help to other people
      TEST;⎕USING;srcUriString;srcUri;client;str
⎕USING←'System,mscorlib.dll'
⎕USING,←⊂'System.IO,mscorlib.dll'
⎕USING,←⊂'System.Net,System.dll'
srcUriString←⎕NEW String(⊂'http://www.cayugafamilydental.com')
srcUri←⎕NEW Uri srcUriString
client←⎕NEW WebClient ⍬
str←client.DownloadString srcUri
⍴str
neeraj
 
Posts: 81
Joined: Wed Dec 02, 2009 12:10 am
Location: Ithaca, NY, USA

Re: Source Code Example - Scraping a Web Page

Postby Dick Bowman on Thu Apr 17, 2014 7:36 am

Any advantage over the example in the .NET Interface Guide?

Always interesting to see different ways to skin a rat. Descriptions of pros and cons can help take what we learn into unknown territory.
Visit http://apl.dickbowman.com to read more from Dick Bowman
User avatar
Dick Bowman
 
Posts: 235
Joined: Thu Jun 18, 2009 4:55 pm

Re: Source Code Example - Scraping a Web Page

Postby jGoff on Thu Apr 17, 2014 3:01 pm

Always good to have a simple "scraper" on hand. Using v12.1, it worked the first time after a copy and a paste. (Careful not to let the first ⎕NEW line wrap.) Thanks for sharing.

P.S. Not to mention that if I ever have a toothache in Ithaca, I'll know where to go.
jGoff
 
Posts: 26
Joined: Fri Jun 19, 2009 12:25 pm

Re: Source Code Example - Scraping a Web Page

Postby neeraj on Thu Apr 17, 2014 6:03 pm

This version is shorter. I tried using the conga workspace and HTTPGet in the Samples namespace but it was not returning the desired result, so I used .NET instead which does return the correct result.
neeraj
 
Posts: 81
Joined: Wed Dec 02, 2009 12:10 am
Location: Ithaca, NY, USA

Re: Source Code Example - Scraping a Web Page

Postby PGilbert on Thu Apr 17, 2014 7:24 pm

Thanks for sharing your short version. Here is what we have done recently inspired by the .Net Interface Guide of Dyalog:

Code: Select all
 TEST2;dataStream;reader;request;response;responseFromServer;url;⎕USING
 ⎕USING←'System.Net,System.dll' 'System.IO,mscorlib.dll' 'System.Text,mscorlib.dll'
 url←'http://www.cayugafamilydental.com'
 request←WebRequest.Create(⊂url)
 response←request.GetResponse
 dataStream←response.GetResponseStream
 reader←⎕NEW StreamReader(dataStream,Encoding.GetEncoding(⊂response.CharacterSet))
 responseFromServer←reader.ReadToEnd


What do you do to extract the information you are looking for from the HTML ? (We ended-up transforming the HTML in XHTML and search it like if it was Xml)
User avatar
PGilbert
 
Posts: 419
Joined: Sun Dec 13, 2009 8:46 pm
Location: Montréal, Québec, Canada

Re: Source Code Example - Scraping a Web Page

Postby DanB|Dyalog on Thu Apr 17, 2014 9:05 pm

neeraj: what did you do to get the page and how did you do it?
I tried
Code: Select all
      )load conga
C:\Program Files\Dyalog\V14U\ws\conga saved Mon Apr 07 17:20:16 2014
      ⍴¨r←Samples.HTTPGet'www.dyalog.com'
   11 2  17192
      r∊⊂p
0 0 1

'p' is the result from your fn. It matches the 3rd element returned by Samples.HTTPGet.
/Dan
DanB|Dyalog
 

Re: Source Code Example - Scraping a Web Page

Postby neeraj on Fri Apr 18, 2014 3:39 am

Samples.HTTPGet 'http://finance.google.com/finance/info?%20client=ig&q=NASDAQ:GOOG,NYSE:IBM'

the above should come back with something like

// [ { "id": "304466804484872" ,"t" : "GOOG" ,"e" : "NASDAQ" ,"l" : "536.10" ,"l_fix" : "536.10" ,"l_cur" : "536.10" ,"s": "2" ,"ltt":"4:00PM EDT" ,"lt" : "Apr 17, 4:00PM EDT" ,"lt_dts" : "2014-04-17T16:00:00Z" ,"c" : "-20.44" ,"c_fix" : "-20.44" ,"cp" : "-3.67" ,"cp_fix" : "-3.67" ,"ccol" : "chr" ,"pcls_fix" : "556.54" ,"el": "538.16" ,"el_fix": "538.16" ,"el_cur": "538.16" ,"elt" : "Apr 17, 7:59PM EDT" ,"ec" : "+2.06" ,"ec_fix" : "2.06" ,"ecp" : "0.38" ,"ecp_fix" : "0.38" ,"eccol" : "chg" ,"div" : "" ,"yld" : "" } ,{ "id": "18241" ,"t" : "IBM" ,"e" : "NYSE" ,"l" : "190.01" ,"l_fix" : "190.01" ,"l_cur" : "190.01" ,"s": "2" ,"ltt":"4:02PM EDT" ,"lt" : "Apr 17, 4:02PM EDT" ,"lt_dts" : "2014-04-17T16:02:08Z" ,"c" : "-6.39" ,"c_fix" : "-6.39" ,"cp" : "-3.25" ,"cp_fix" : "-3.25" ,"ccol" : "chr" ,"pcls_fix" : "196.4" ,"el": "190.37" ,"el_fix": "190.37" ,"el_cur": "190.37" ,"elt" : "Apr 17, 7:57PM EDT" ,"ec" : "+0.36" ,"ec_fix" : "0.36" ,"ecp" : "0.19" ,"ecp_fix" : "0.19" ,"eccol" : "chg" ,"div" : "0.95" ,"yld" : "2.00" } ]

which is a JSON with a 2 element array. I do not get the above result. My original post was a contrived example to avoid JSON issues when the focus was on HTTPGet
neeraj
 
Posts: 81
Joined: Wed Dec 02, 2009 12:10 am
Location: Ithaca, NY, USA

Re: Source Code Example - Scraping a Web Page

Postby neeraj on Fri Apr 18, 2014 3:50 am

PGilbert:

Here is how I have been dealing with HTML. It is a snippet but will give you a flavor. I am just looking for specific information in the HTML.

      :Case 2
⍝ Schwab A Rated Stocks
C←NFILE∆READ(∆FILEPATH,'SCHWAB\SCHWABA1.WEBARCHIVE')
S←7↓¨2000↑¨(' symbol="'⍷C)⊂C ⍝ All stock names are of the form ' symbol="IBM"'
AGRADE←STK¨S
C←NFILE∆READ(∆FILEPATH,'SCHWAB\SCHWABA2.WEBARCHIVE')
S←20↑¨(' symbol="'⍷C)⊂C
AGRADE←AGRADE,STK¨S
C←NFILE∆READ(∆FILEPATH,'SCHWAB\SCHWABA3.WEBARCHIVE')
S←20↑¨(' symbol="'⍷C)⊂C
AGRADE←fIXNAME¨AGRADE,STK¨S
∆MTX[(∆IN ∆MTX[;3]≡¨⊂'A');3]←⊂'--' ⍝ All previous A Grades are reset to --
I←∆MTX[;2]⍳AGRADE
existing←(I<1↑⍴∆MTX)/I
∆MTX[existing;3]←'A'
∆MTX[NR;3]←⊂date 0
neeraj
 
Posts: 81
Joined: Wed Dec 02, 2009 12:10 am
Location: Ithaca, NY, USA

Re: Source Code Example - Scraping a Web Page

Postby Morten|Dyalog on Fri Apr 18, 2014 7:36 am

If you try this with most of the HTTPGet functions that are out there, it will fail for two reasons. First because the URL has been redirected (the www has been removed from the address):

Code: Select all
      Samples.HTTPGet 'http://www.cayugafamilydental.com'
0   http/1.1 301 moved permanently                                                                     
    date                             Thu, 17 Apr 2014 06:19:02 GMT                                     
    server                           Apache                                                           
    x-powered-by                     PHP/5.4.27                                                       
    expires                          Thu, 19 Nov 1981 08:52:00 GMT                                     
    cache-control                    no-store, no-cache, must-revalidate, post-check=0, pre-check=0   
    pragma                           no-cache                                                         
    x-pingback                       http://cayugafamilydental.com/xmlrpc.php                         
    set-cookie                       PHPSESSID=a1eca9c1da2b358b76b2708acf53b07b; path=/               
    location                         http://cayugafamilydental.com/                                   
    content-length                   0                                                                 
    content-type                     text/html; charset=UTF-8                                         

Secondly, if you switch to the correct address, it fails because the content is compressed in a "chunked" mode which we did not support. The attached file contains source for the Samples.HTTPGet function that will be distributed with v14.0: It both handles the redirection and the chunking. The advantage of HTTPGet over the .NET solution is that it is cross-platform, it will work under Windows, AIX, Linux (including the Raspberry Pi) - and the future versions of Dyalog APL that we are currently working on (MacOS and Android - release dates still not set).
Attachments
HTTPGet.dyalog
New code for CONGA workspace Samples.HTTPGet function
(5.23 KiB) Downloaded 751 times
User avatar
Morten|Dyalog
 
Posts: 406
Joined: Tue Sep 09, 2008 3:52 pm

Re: Source Code Example - Scraping a Web Page

Postby Brian|Dyalog on Fri Apr 18, 2014 1:55 pm

Using the HTTPGet that Morten supplied, you can retrieve the JSON result you want...

      rc hdrs response←Samples.HTTPGet 'http://finance.google.com/finance/info?%20client=ig&q=NASDAQ:GOOG,NYSE:IBM' 

response~⎕ucs 13 10 ⍝ remove carriage returns and linefeeds (wrapping is due to ⎕PW)
// [{"id": "304466804484872","t" : "GOOG","e" : "NASDAQ","l" : "536.10","l_fix" : "536.10","l_cur" : "536.10","s": "0","lt
t":"4:00PM EDT","lt" : "Apr 17, 4:00PM EDT","lt_dts" : "2014-04-17T16:00:00Z","c" : "-20.44","c_fix" : "-20.44","cp"
: "-3.67","cp_fix" : "-3.67","ccol" : "chr","pcls_fix" : "556.54"},{"id": "18241","t" : "IBM","e" : "NYSE","l" : "1
90.01","l_fix" : "190.01","l_cur" : "190.01","s": "0","ltt":"4:02PM EDT","lt" : "Apr 17, 4:02PM EDT","lt_dts" : "201
4-04-17T16:02:08Z","c" : "-6.39","c_fix" : "-6.39","cp" : "-3.25","cp_fix" : "-3.25","ccol" : "chr","pcls_fix" : "19
6.4"}]


The leading // is not valid JSON - you can verify this by pasting the result into an online JSON validator like the one found at http://jsonlint.com/

Then you can use the JSON namespace that's attached below to convert the JSON to a form more usable from APL.
The JSON namespace was developed as a part of the MiServer project, but is a useful standalone utility as well.

      stocks←JSON.JSONtoNS 2↓response  ⍝ drop off the leading // and convert JSON to namespace format
stocks ⍝ each stock symbol is its own namespace
#.JSON.[Namespace] #.JSON.[Namespace]

]disp (⊃stocks).⎕nl -2 ⍝ each namespace contains variables corresponding to the JSON elements
┌→┬─────┬────┬──┬──────┬─┬──┬─┬─────┬─────┬──┬──────┬───┬────────┬─┬─┐
│c│c_fix│ccol│cp│cp_fix│e│id│l│l_cur│l_fix│lt│lt_dts│ltt│pcls_fix│s│t│
└→┴────→┴───→┴─→┴─────→┴→┴─→┴→┴────→┴────→┴─→┴─────→┴──→┴───────→┴→┴→┘

]disp stocks.(t l lt_dts) ⍝ now you've got something you can manipulate from APL
┌→─────────────────────────────────┬─────────────────────────────────┐
│┌→───┬──────┬────────────────────┐│┌→──┬──────┬────────────────────┐│
││GOOG│536.10│2014-04-17T16:00:00Z│││IBM│190.01│2014-04-17T16:02:08Z││
│└───→┴─────→┴───────────────────→┘│└──→┴─────→┴───────────────────→┘│
└─────────────────────────────────→┴────────────────────────────────→┘

↑stocks.(t l lt_dts)
GOOG 536.10 2014-04-17T16:00:00Z
IBM 190.01 2014-04-17T16:02:08Z
Attachments
JSON.dyalog
JSON/APL conversion utilities
(25.77 KiB) Downloaded 750 times
User avatar
Brian|Dyalog
 
Posts: 109
Joined: Thu Nov 26, 2009 4:02 pm
Location: West Henrietta, NY

Next

Return to Microsoft.NET

Who is online

Users browsing this forum: No registered users and 1 guest