I am sure this is child’s play for some folks here @ DMS, but I’m not one of them.
Let’s say you have a web site you have to use (i.e. have no other interface than what you see in a browser; no API, no ‘command prompt’, nothing but the browser view) which is behind a login and blah blah blah long story short: it eventually feeds back to you a page with a table containing various elements you would like to store into a simple csv. Normally, this would be simple enough to select with mouse, copy, and paste into {MS Excel or like} but this particular page has a bunch of HTML-type nonsense encapsulating some, but not all, of the fields’ data.
I would like to know how someone (I) could capture that page’s data, strip the html nonsense, and save to csv or equivalent. I have several working theories, but since this is all brand new to me, I figured I could ask rather than reinventing what I’m sure is barely a blip on some of your skillsets’ …
What I know so far that might be useful: the data itself appears to be contained in the “code” if “page source” is viewed.
Each row (record) looks something like this:
What I need out of it is the Name and “12345789” number from the end of line 12, and the email address contained repeatedly in Line 16. As far as I can tell, this jumbled mess appear for each record I desire, along with a metric poop-tonne of garbage that web pages all seem to contain…
I appreciate any pointers.
In the past I’ve done similar using some language. What I think is most important are two commands. One for retrieving the website. Ideally a function where you give a url, and the function gives back a string (of nonsense). Some languages may even have specifiers that only return the nonsense you want … The second function would be what I would call the ‘regular expression.’ I think most languages have a regular expression type command that can pretty much parse the living daylights out of any text.
If one were to use Octave (it’s on the jump server) …
Occasional ad hoc. Quarterly, or so.
Tranquility Reader strips Javascripting from the page, which (apparently) eats the first field, containing the Name and “123456789” number I need. As far as I can tell, the “javascripting” is one of the “html-like nonesense”…um…“features?” that makes copying and pasting manually M U C H more work than it should be (and hence, why I ask this question). I suppose it adds some value I have yet to understand… Tranquility Reader DOES make the page more visually appealing, though!
@darrent Thank you for this. I fear i am entirely too dense to make this work for me, but I’m checking it out…
I think this is probably the way to go…I did not review DT’s links, and I’d use Python, myself, but I think the process, in either case, is pretty straightforward.
Python approach typically would use Beautiful Soup (don’t ask me where the name came from…):
Here’s a basic example which does not do exactly what is wanted but does show how few lines of actual code would be needed:
the -s suppresses the progress bars etc from the output, and you would put in the cookie value you receive for authenticating with the website login so the tool can pull the data.
This will pull out the full line containing “identifier.” This will be the initial pull before we filter out the parts of lines we want.
We then can run against a RegEx to get something more specific. Not gonna lie, my RegEx is terrible. This is probably as shorthand of a reference as it gets: grep regular expression syntax (GNU Findutils 4.10.0)
I would like to point out that HTML and XML share the same root (SGML) and there for XPath / XQuery tend to work for one’s HTML pages (expecially the xhtml types).
I’d like to thank each of you for chiming in very helpfully!
Although I didn’t understand 90% of what was said, I think the 10% I did understand has got my compass oriented properly. I re-discovered something I already knew (but had forgotten I know) which is that this is a 2 part problem:
I need to discover the pattern in the output so I can devise a method of extracting the parts i need. It’s a table, buddy. There’s a pattern. Break it down. Look at the elements. Figure out how the browser knows what each element is and where to put it on a line. There ya go… (I had allowed this bit to overwhelm me, but I’ve got it now, and I’ve got it documented; now I just need to test and implement)
Getting the data. Despite the assurances of what you guys have posted that this isn’t that hard, I am failing to see a method by which I can accomplish this easily. It’s a forms-based thingy which doesn’t use URL manipulation (except for the initial login, which, thanks to Jim’s reminding me about curl, I can confirm DOES work, but does NOT give me my data because none of the forms are filled in, I guess; not a web guy, so I might not be using the right terms – apologies). Anyway, for no more often than I’ll need this, I realized I’m actually doing MORE work than if I just pluck the fruit that’s hanging before my eyes:
So, for now, all I need to do is browse to the page, put in the required selections, and grab the page source text. Then I can use whatever I get out of Step 1 to do the hard work for me. Maybe some day I’ll understand how the other methods posted can work for this particular scenario (I have no doubt I am the limitation, but I’m going to blame the technology anyway in the meantime!) and make this a “button push”, but for now, this is all I was hoping for.
I knew this would be child’s play for this crowd!
Thank you to each of you again!