How to code an algorithm that creates a RegEx expression if provided with two identical elements on a website?

bernard · Jan 9, 2018

How difficult would it be to code an algorithm that can create its own RegEx expression if provided with two identical elements on a website? For use in scraping price on websites.

Rageix · Jan 9, 2018

ragnar said:
How difficult would it be to code an algorithm that can create its own RegEx expression if provided with two identical elements on a website? For use in scraping price on websites.

Normally it's not all that hard, but depends on how complex the regex is. For scraping prices on websites it's typically not all that difficult because most stores are laid out in a sane, predictable, fashion. RegEx might not even be the best choice honestly there are lots of different ways to scrape a website, xPath can work really well, or you run in to random things like goquery which help out scraping a ton because you can just use css selectors like in jQuery.

bernard · Jan 9, 2018

Rageix said:
Normally it's not all that hard, but depends on how complex the regex is. For scraping prices on websites it's typically not all that difficult because most stores are laid out in a sane, predictable, fashion. RegEx might not even be the best choice honestly there are lots of different ways to scrape a website, xPath can work really well, or you run in to random things like goquery which help out scraping a ton because you can just use css selectors like in jQuery.

Hey man, I know how to scrape a little, what I need is something else. There's a plugin I used to have that would be able to figure out regex expressions just from selecting 2 prices from 2 unique products. Do I make sense? Like it would ask for product url 1 and price 1, then product url 2 and price 2, and then you'd be able to scrape prices without writing any regex yourself. So there had to be some kind of smart detection going on behind the scenes.

Rageix · Jan 9, 2018

Hmm that's interesting I've never seen anything like that, at least not for regex.

However something along those lines are Chrome and Firefox both have dev tools. Right click the page and open your dev tools, from the Inspector tab (Firefox), or Elements tab (Chrome) you can right click any element and then copy it's XPath. From there it can be pretty simple to just use XPath to scrape. My guess is that is probably what it was, but you never know.

bernard · Jan 9, 2018

I don't know, probably wasn't regex then :smile:

it did and does work though, do you want me to send you a link to the plugin?

Rageix · Jan 10, 2018

Sure, I'd take a look at it.

How to code an algorithm that creates a RegEx expression if provided with two identical elements on a website?

bernard

Rageix

bernard

Rageix

bernard

Rageix