summaryrefslogtreecommitdiff
path: root/searx/engines/xpath.py
AgeCommit message (Collapse)Author
2025-02-26[fix] various issues in the documentationMarkus Heiser
Closes: https://github.com/searxng/searxng/issues/4370 Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
2025-01-28[refactor] typification of SearXNG / EngineResultsMarkus Heiser
In [1] and [2] we discussed the need of a Result.results property and how we can avoid unclear code. This patch implements a class for the reslut-lists of engines:: searx.result_types.EngineResults A simple example for the usage in engine development:: from searx.result_types import EngineResults ... def response(resp) -> EngineResults: res = EngineResults() ... res.add( res.types.Answer(answer="lorem ipsum ..", url="https://example.org") ) ... return res [1] https://github.com/searxng/searxng/pull/4183#pullrequestreview-257400034 [2] https://github.com/searxng/searxng/pull/4183#issuecomment-2614301580 Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
2025-01-28[refactor] typification of SearXNG (initial) / result items (part 1)Markus Heiser
Typification of SearXNG ======================= This patch introduces the typing of the results. The why and how is described in the documentation, please generate the documentation .. $ make docs.clean docs.live and read the following articles in the "Developer documentation": - result types --> http://0.0.0.0:8000/dev/result_types/index.html The result types are available from the `searx.result_types` module. The following have been implemented so far: - base result type: `searx.result_type.Result` --> http://0.0.0.0:8000/dev/result_types/base_result.html - answer results --> http://0.0.0.0:8000/dev/result_types/answer.html including the type for translations (inspired by #3925). For all other types (which still need to be set up in subsequent PRs), template documentation has been created for the transition period. Doc of the fields used in Templates =================================== The template documentation is the basis for the typing and is the first complete documentation of the results (needed for engine development). It is the "working paper" (the plan) with which further typifications can be implemented in subsequent PRs. - https://github.com/searxng/searxng/issues/357 Answer Templates ================ With the new (sub) types for `Answer`, the templates for the answers have also been revised, `Translation` are now displayed with collapsible entries (inspired by #3925). !en-de dog Plugins & Answerer ================== The implementation for `Plugin` and `Answer` has been revised, see documentation: - Plugin: http://0.0.0.0:8000/dev/plugins/index.html - Answerer: http://0.0.0.0:8000/dev/answerers/index.html With `AnswerStorage` and `AnswerStorage` to manage those items (in follow up PRs, `ArticleStorage`, `InfoStorage` and .. will be implemented) Autocomplete ============ The autocompletion had a bug where the results from `Answer` had not been shown in the past. To test activate autocompletion and try search terms for which we have answerers - statistics: type `min 1 2 3` .. in the completion list you should find an entry like `[de] min(1, 2, 3) = 1` - random: type `random uuid` .. in the completion list, the first item is a random UUID Extended Types ============== SearXNG extends e.g. the request and response types of flask and httpx, a module has been set up for type extensions: - Extended Types --> http://0.0.0.0:8000/dev/extended_types.html Unit-Tests ========== The unit tests have been completely revised. In the previous implementation, the runtime (the global variables such as `searx.settings`) was not initialized before each test, so the runtime environment with which a test ran was always determined by the tests that ran before it. This was also the reason why we sometimes had to observe non-deterministic errors in the tests in the past: - https://github.com/searxng/searxng/issues/2988 is one example for the Runtime issues, with non-deterministic behavior .. - https://github.com/searxng/searxng/pull/3650 - https://github.com/searxng/searxng/pull/3654 - https://github.com/searxng/searxng/pull/3642#issuecomment-2226884469 - https://github.com/searxng/searxng/pull/3746#issuecomment-2300965005 Why msgspec.Struct ================== We have already discussed typing based on e.g. `TypeDict` or `dataclass` in the past: - https://github.com/searxng/searxng/pull/1562/files - https://gist.github.com/dalf/972eb05e7a9bee161487132a7de244d2 - https://github.com/searxng/searxng/pull/1412/files - https://github.com/searxng/searxng/pull/1356 In my opinion, TypeDict is unsuitable because the objects are still dictionaries and not instances of classes / the `dataclass` are classes but ... The `msgspec.Struct` combine the advantages of typing, runtime behaviour and also offer the option of (fast) serializing (incl. type check) the objects. Currently not possible but conceivable with `msgspec`: Outsourcing the engines into separate processes, what possibilities this opens up in the future is left to the imagination! Internally, we have already defined that it is desirable to decouple the development of the engines from the development of the SearXNG core / The serialization of the `Result` objects is a prerequisite for this. HINT: The threads listed above were the template for this PR, even though the implementation here is based on msgspec. They should also be an inspiration for the following PRs of typification, as the models and implementations can provide a good direction. Why just one commit? ==================== I tried to create several (thematically separated) commits, but gave up at some point ... there are too many things to tackle at once / The comprehensibility of the commits would not be improved by a thematic separation. On the contrary, we would have to make multiple changes at the same places and the goal of a change would be vaguely recognizable in the fog of the commits. Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
2024-11-29[mod] hardening xpath engine: ignore empty resultsMarkus Heiser
A SearXNG maintainer on Matrix reported a traceback:: File "searxng-src/searx/engines/xpath.py", line 272, in response dom = html.fromstring(resp.text) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "searx-pyenv/lib/python3.11/site-packages/lxml/html/__init__.py", line 850, in fromstring doc = document_fromstring(html, parser=parser, base_url=base_url, **kw) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "searx-pyenv/lib/python3.11/site-packages/lxml/html/__init__.py", line 738, in document_fromstring raise etree.ParserError( lxml.etree.ParserError: Document is empty I don't have an example to reproduce the issue, but the issue and this patch are clearly recognizable even without an example. Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
2024-11-28[feat] json/xpath engine: config option for method and bodyBnyro
2024-05-16[mod] simple theme: drop img_src from default resultsMarkus Heiser
The use of img_src AND thumbnail in the default results makes no sense (only a thumbnail is needed). In the current state this is rather confusing, because img_src is displayed like a thumbnail (small) and thumbnail is displayed like an image (large). Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
2024-03-11[mod] pylint all engines without PYLINT_SEARXNG_DISABLE_OPTIONMarkus Heiser
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
2023-09-18[fix] spellingjazzzooo
2023-07-01[doc] rearranges Settings & Engines docs for better readabilityMarkus Heiser
We have built up detailed documentation of the *settings* and the *engines* over the past few years. However, this documentation was still spread over various chapters and was difficult to navigate in its entirety. This patch rearranges the Settings & Engines documentation for better readability. To review new ordered docs:: make docs.clean docs.live Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
2022-09-27[fix] typos / reported by @kianmeng in searx PR-3366Markus Heiser
[PR-3366] https://github.com/searx/searx/pull/3366 Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
2022-09-04xpath engine: change raise_for_httperror to no_result_for_http_statusAlexandre FLAMENT
no_result_for_http_status contains a list of HTTP status. These HTTP status are seen an empty result list. In other cases an exception is thrown as usual. Previously raise_for_httperror were ignoring all HTTP error, which make defective engines invisible in the stats.
2022-09-04[fix] engine woxikon.de - don't raise exception on empty result listMarkus Heiser
Woxikon expects a word in German, so with query "foo" the site finds nothing and respons a 404: httpx.HTTPStatusError: Client error '404 Not Found' \ for url 'https://synonyme.woxikon.de/synonyme/foo.php' [1] https://github.com/searxng/searxng/issues/1543#issuecomment-1193317054 Closes: https://github.com/searxng/searxng/issues/1543 Suggested-by: @allendema [1] Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
2022-04-17[fix[ Update only cookies/headersAllen
2022-04-17[lint] Remove whitespaceAllen
From GH GUI
2022-04-16[enh] Allow passing headers/cookies from settings.ymlAllen
Example: - engine: xpath - search_url: example.org - headers: {'example_header': 'example_header'} - cookies: {'safesearch': 'off'}
2021-12-27[format.python] initial formatting of the python codeMarkus Heiser
This patch was generated by black [1]:: make format.python [1] https://github.com/psf/black Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
2021-09-11[mod] xpath engine: remove logging of the requested URLAlexandre Flament
2021-09-07[pylint] engines: drop no longer needed 'missing-function-docstring'Markus Heiser
Suggested-by: @dalf https://github.com/searxng/searxng/issues/102#issuecomment-914168470 Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
2021-09-07[fix] add 'categories' to PYLINT_ADDITIONAL_BUILTINS_FOR_ENGINESMarkus Heiser
androp no longer needed (see line 591 in 7b235a1):: # pylint: disable=undefined-variable Suggested-by: @dalf https://github.com/searxng/searxng/issues/102#issuecomment-914068609 Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
2021-09-06[mod] one logger per engine - drop obsolete logger.getChildMarkus Heiser
Remove the no longer needed `logger = logger.getChild(...)` from engines. Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
2021-09-04[fix] remove minimum length of content for XPath engineMarkus Heiser
Instead of raising an exception and therefore hiding all results of the engine. It make sense to remove that requirement in order to allow the implementation of search engines that do not always have a description. In fact some search engines that in 99% of the case have a description like Brave Search or Mojeek crash completely if they for some reason included a result with no description. To test this patch try Mojeek: !mjk xyz before and after the patch. Suggested-by: 0xhtml in https://github.com/searx/searx/discussions/2933 Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
2021-05-23[enh] XPath engine - add time safe-search supportMarkus Heiser
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
2021-05-23[enh] XPath engine - add time range supportMarkus Heiser
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
2021-05-23[enh] XPath engine - add ISO 639-1 {lang} replacement to search-URLMarkus Heiser
BTW: remove obsolte params['query'] and not needed paging condition. Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
2021-05-23[doc] add documentation about the XPath engineMarkus Heiser
- pylint searx/engines/xpath.py - fix indentation of some long lines - add logging - add doc-strings Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
2021-05-17[enh] xpath engine - add request parameter 'soft_max_redirects'Markus Heiser
Make 'soft_max_redirects' configurable per Xpath engine:: - name : <engine-name> engine : xpath soft_max_redirects: 1 ... Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
2021-01-14[enh] engines: add about variableAlexandre Flament
move meta information from comment to the about variable so the preferences, the documentation can show these information
2020-12-10[fix] xpath, mojeek: fix commit 58d72f26925d56e22330c54be03c3dcbee0c4135Alexandre Flament
before commit 58d72f2, category was not set in xpath.py, so searx/engines/__init__py was setting the category to ['general'] the commit 58d72f2 set the category to [] which is not replaced by searx/engines/__init__.py consequence: the mojeek engine is hidden in the preferences. this commit revert the xpath.py change. close #2368
2020-12-03[mod] xpath, 1337x, acgsou, apkmirror, archlinux, arxiv: use eval_xpath_* ↵Alexandre Flament
functions
2020-11-03[mod] pylint: minor code change to allow pylint globallyAlexandre Flament
This commit is only a step, it doesn't fix all the issues reported by pylint
2020-10-25[enh] Add onions category with Ahmia, Not Evil and Torcha01200356
Xpath engine and results template changed to account for the fact that archive.org doesn't cache .onions, though some onion engines migth have their own cache. Disabled by default. Can be enabled by setting the SOCKS proxies to wherever Tor is listening and setting using_tor_proxy as True. Requires Tor and updating packages. To avoid manually adding the timeout on each engine, you can set extra_proxy_timeout to account for Tor's (or whatever proxy used) extra time.
2020-10-02[mod] move extract_text, extract_url to searx.utilsAlexandre Flament
2020-09-10Drop Python 2 (1/n): remove unicode string and url_utilsDalf
2020-07-23Fix relative urls that do not start with '/'xywei
2019-11-15[mod] speed optimizationDalf
compile XPath only once avoid redundant call to urlparse get_locale(webapp.py): avoid useless call to request.accept_languages.best_match
2019-07-25[fix] fixes google play engines (#1651)Alexandre Flament
update commit 87baa74a863ac74ae4c86bbfcb04148ba7f70696
2019-07-25[fix] fixes google play engines and adds thumbnails to their results (#1612)Venca24
fix google play apps, google play apps, google play music engines xpath engine: thumbnail_xpath can define an optional thumbnail
2018-04-08[fix] append http if no scheme is provided in xpath's extact_urlMarc Abonce Seguin
This solves a bug with Yahoo where some results don't specify a protocol.
2017-05-22[fix] produce valid urls if scheme is missingAdam Tauber
2017-05-15[enh] py3 compatibilityAdam Tauber
2017-01-17[fix] allow empty contentDavid A Roberts
2016-12-31[fix] extract_text: use html.tostring instead html_to_text. Fix #711Alexandre Flament
2016-08-14[fix] behaviour for page_size>1 and first_page_num>0David A Roberts
eg. pageno=1,21,41,... instead of 20,40,60,...
2016-03-28Add paging support to XPath & Erowid enginesKirill Isakov
2016-01-18[fix] pep8 compatibiltyAdam Tauber
2015-01-25Sanitize extract_textCqoicebordel
2014-03-04[fix] error when xpath_results in extraxt_text is _ElementUnicodeResult ↵potato
instead of _ElementStringResult
2014-02-11[mod] len() removed from conditionsasciimoo
2014-01-30[fix] function parametersasciimoo
2014-01-30[fix] function parametersasciimoo