Conversation
b87c5b7 to
a080294
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #211 +/- ##
==========================================
Coverage 100.00% 100.00%
==========================================
Files 38 39 +1
Lines 2221 2327 +106
Branches 426 446 +20
==========================================
+ Hits 2221 2327 +106 ☔ View full report in Codecov by Sentry. |
3e08c87 to
0ce636c
Compare
rgaudin
left a comment
There was a problem hiding this comment.
LGTM but it lacks proper documentation. It's important the expected behavior is clearly documented so the user can make informed choices
|
|
||
| @property | ||
| def exception(self): | ||
| """Exception raises in any thread, if any""" |
There was a problem hiding this comment.
| """Exception raises in any thread, if any""" | |
| """Exception raised in any thread, if any""" |
| self._workers: set[threading.Thread] = set() | ||
| self.no_more = False | ||
| self._shutdown = False | ||
| self.exceptions[:] = [] |
There was a problem hiding this comment.
| self.exceptions[:] = [] | |
| self.exceptions.clear() |
| self._shutdown = False | ||
| self.exceptions[:] = [] | ||
|
|
||
| for n in range(self.nb_workers): |
| try: | ||
| func(**kwargs) | ||
| except Exception as exc: | ||
| logger.error(f"Error processing {func} with {kwargs=}") |
There was a problem hiding this comment.
| logger.error(f"Error processing {func} with {kwargs=}") | |
| logger.error(f"Error processing function {func.__name__} with {kwargs=}") |
| # received None from the queue. most likely shuting down | ||
| return | ||
|
|
||
| raises = kwargs.pop("raises") if "raises" in kwargs.keys() else False |
There was a problem hiding this comment.
those things are now part of the submit API and should be documented there (in its docstring)
| logger.error(f"Error processing {func} with {kwargs=}") | ||
| logger.exception(exc) | ||
| if raises: # to cover when raises = False | ||
| self.exceptions.append(exc) |
There was a problem hiding this comment.
Is this intended behavior? Exceptions are swallowed without a trace (when using raise=False)? If so, this must be clearly documented. Not raising but storing the exception is another valid alternative.
| except queue.Empty: | ||
| break | ||
|
|
||
| def join(self): |
There was a problem hiding this comment.
What's the recommended way to await an executor completion then?
With this, the deadline_sec being mandatory, I can only use join when I want to exit but in a scraper, I imagine the scenario being: I collect all my resources and submit them to the executor then I join and once everything has been processed (once join completes), I can continue.
This is the regular meaning of join.
If we want it to act as a properly exit method, then the user has to track progress manually and this should be clearly specified in the documentation of the executor
| while ( | ||
| len(alive_threads) > 0 and datetime.datetime.now(tz=datetime.UTC) < deadline | ||
| ): | ||
| e = threading.Event() |
|
Converting to draft, we are experimenting with joblib in mindtouch scraper for now |
This PR enrich the scraperlib with a
ScraperExecutor. This class is capable to process tasks in parallel, with a given number of worker threads.This executor is mainly inspired from sotoki executor, even if we can find other executors in wikihow and in iFixit. wikihow one seems more primitive / ancient, and iFixit is just a pale copy.
For easy review, first commit is simply a copy/paste of sotoki code, and next commit are the adaptations / enhancement for scraperlib
What has been changed compared to sotoki code:
thread_deadline_secto the executor, should we need to customize it per executor (probably the case, priceless and useful for tests at least)if self.no_more:insubmitmethod: allows to stop accepting task even when the executor is justjoinedand notshutdownprefixtoexecutor_nameand moved fromT-toexecutor(way more clear in the logs from my experience)release_haltmethod which was misleading / not working (I failed tojoinand thenrelease_haltand thensubmitagain ... it seems mandatory tojointhenstart(again) thensubmit)thread_deadline_secseconds per thread. This is highly unpredictable when there are many workers (we could waitthread_deadline_secfor first worker, thenthread_deadline_secfor second worker, etc ...), and it is a bit weird that last worker in the list has way more time to complete than first oneThis executor will be used right now in mindtouch scraper.