Vibe.d based Web crawler/scraper

Permalink: HTTP NNTP

RASOLOFONIAINA Menjanahary Razafindranto

Posted Sat, 28 Mar 2015 15:59:38 GMT

Reply

Are the Client requests components of the framework fitted to write a web crawler/scraper?

Re: Vibe.d based Web crawler/scraper

Permalink: HTTP NNTP

Sönke Ludwig

Posted Sat, 28 Mar 2015 19:05:32 +0100 in reply to RASOLOFONIAINA Menjanahary Razafindranto

Reply

Am 28.03.2015 um 16:59 schrieb RASOLOFONIAINA Menjanahary Razafindranto:

Are the Client requests components of the framework fitted to write a web crawler/scraper?

You could certainly do that. vibe.http.client.requestHTTP uses a
connection pool internally, so that existing server (keep-alive)
connections will be reused, so this kind of access pattern should be
efficient. Running multiple parallel requests using runTask is also
not a problem. The only thing you'd potentially have to write on your
own is a throttling logic to avoid using up too many server resources.

Re: Vibe.d based Web crawler/scraper

Permalink: HTTP NNTP

RASOLOFONIAINA Menjanahary Razafindranto

Posted Sun, 05 Apr 2015 09:21:05 GMT in reply to Sönke Ludwig

Reply

On Sat, 28 Mar 2015 19:05:32 +0100, Sönke Ludwig wrote:

Am 28.03.2015 um 16:59 schrieb RASOLOFONIAINA Menjanahary Razafindranto:

Are the Client requests components of the framework fitted to write a web crawler/scraper?

You could certainly do that. vibe.http.client.requestHTTP uses a
connection pool internally, so that existing server (keep-alive)
connections will be reused, so this kind of access pattern should be
efficient. Running multiple parallel requests using runTask is also
not a problem. The only thing you'd potentially have to write on your
own is a throttling logic to avoid using up too many server resources.

Waow, That's promising!

I have delved into the Module documentation.

Vibe.d doesn't provide any resources for DOM manipulation and Node selection, do you have some suggestion please?

Re: Vibe.d based Web crawler/scraper

Permalink: HTTP NNTP

Sönke Ludwig

Posted Sun, 05 Apr 2015 14:47:45 GMT in reply to RASOLOFONIAINA Menjanahary Razafindranto

Reply

On Sun, 05 Apr 2015 09:21:05 GMT, RASOLOFONIAINA Menjanahary Razafindranto wrote:

On Sat, 28 Mar 2015 19:05:32 +0100, Sönke Ludwig wrote:

Am 28.03.2015 um 16:59 schrieb RASOLOFONIAINA Menjanahary Razafindranto:

Are the Client requests components of the framework fitted to write a web crawler/scraper?

You could certainly do that. vibe.http.client.requestHTTP uses a
connection pool internally, so that existing server (keep-alive)
connections will be reused, so this kind of access pattern should be
efficient. Running multiple parallel requests using runTask is also
not a problem. The only thing you'd potentially have to write on your
own is a throttling logic to avoid using up too many server resources.

Waow, That's promising!

I have delved into the Module documentation.

Vibe.d doesn't provide any resources for DOM manipulation and Node selection, do you have some suggestion please?

The things I know of are a DOM module in the arsd collection and there is the recently announced htmld package.

Re: Vibe.d based Web crawler/scraper

Permalink: HTTP NNTP

yawniek

Posted Mon, 13 Apr 2015 20:35:17 GMT in reply to Sönke Ludwig

Reply

On Sat, 28 Mar 2015 19:05:32 +0100, Sönke Ludwig wrote:

Am 28.03.2015 um 16:59 schrieb RASOLOFONIAINA Menjanahary Razafindranto:

Are the Client requests components of the framework fitted to write a web crawler/scraper?

You could certainly do that. vibe.http.client.requestHTTP uses a
connection pool internally, so that existing server (keep-alive)
connections will be reused, so this kind of access pattern should be
efficient. Running multiple parallel requests using runTask is also
not a problem. The only thing you'd potentially have to write on your
own is a throttling logic to avoid using up too many server resources.

i did this a few weeks ago to an extend where i brought down the api server, so its quite fast :)

roughly i used this pattern: https://gist.github.com/yannick/98c94cb6530d8aabd420
there might be a better approach though.

all in all a smooth ride with two main problems as i remember:

i had to use requestHTTP with delegates as under high concurrency i got weird errors
from bodyReader when using the HTTPClientResponse (probably too late ).

a memory leak (which in the end i did not fix) .

Re: Vibe.d based Web crawler/scraper

Permalink: HTTP NNTP

RASOLOFONIAINA Menjanahary Razafindranto

Posted Tue, 12 May 2015 07:36:46 GMT in reply to yawniek

Reply

On Mon, 13 Apr 2015 20:35:17 GMT, yawniek wrote:

On Sat, 28 Mar 2015 19:05:32 +0100, Sönke Ludwig wrote:

Am 28.03.2015 um 16:59 schrieb RASOLOFONIAINA Menjanahary Razafindranto:

Are the Client requests components of the framework fitted to write a web crawler/scraper?

You could certainly do that. vibe.http.client.requestHTTP uses a
connection pool internally, so that existing server (keep-alive)
connections will be reused, so this kind of access pattern should be
efficient. Running multiple parallel requests using runTask is also
not a problem. The only thing you'd potentially have to write on your
own is a throttling logic to avoid using up too many server resources.

i did this a few weeks ago to an extend where i brought down the api server, so its quite fast :)

roughly i used this pattern: https://gist.github.com/yannick/98c94cb6530d8aabd420
there might be a better approach though.

all in all a smooth ride with two main problems as i remember:

i had to use requestHTTP with delegates as under high concurrency i got weird errors
from bodyReader when using the HTTPClientResponse (probably too late ).

a memory leak (which in the end i did not fix) .

I appreciate your GIST. Thanks for the input.