Giter Site home page Giter Site logo

damklis / dataengineeringproject Goto Github PK

View Code? Open in Web Editor NEW
1.1K 13.0 214.0 1.8 MB

Example end to end data engineering project.

License: MIT License

Dockerfile 4.40% Python 87.92% Shell 7.68%
big-data scraping mongodb elasticsearch data-engineering kafka kafka-connect debezium django-rest-framework redis

dataengineeringproject's People

Contributors

damklis avatar dependabot[bot] avatar szczeles avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dataengineeringproject's Issues

Token Key Attribute

Hi,
I have successfully generated the Token for myself. The next step is to include token in header of request. What is the key attribute for token header?

BAD REQUEST (400)

I am getting BAD REQUEST (400) when I am trying to connect to any url ex: http://0.0.0.0:5000/api/v1/news/. What are the steps to resolve this?
Output of ./manage.sh up

Creating infrastructure...
Recreating mongo ... done
MongoDB shell version v4.2.17
connecting to: mongodb://localhost:27017/rss_news?compressors=disabled&gssapiServiceName=mongodb
Implicit session: session { "id" : UUID("59975a30-4c37-429f-a574-f5178dc7f588") }
MongoDB server version: 4.2.17
{
	"ok" : 0,
	"errmsg" : "command replSetInitiate requires authentication",
	"code" : 13,
	"codeName" : "Unauthorized"
}
bye
Initiated replica set
MongoDB shell version v4.2.17
connecting to: mongodb://localhost:27017/admin?compressors=disabled&gssapiServiceName=mongodb
Implicit session: session { "id" : UUID("22b505ad-c940-46bf-ba2b-2620b4bacd36") }
MongoDB server version: 4.2.17
2021-10-11T04:28:05.628+0000 E  QUERY    [js] uncaught exception: Error: couldn't add user: command createUser requires authentication :
_getErrorWithCode@src/mongo/shell/utils.js:25:13
DB.prototype.createUser@src/mongo/shell/db.js:1413:11
@(shell):1:1
2021-10-11T04:28:05.628+0000 E  QUERY    [js] uncaught exception: Error: command grantRolesToUser requires authentication :
_getErrorWithCode@src/mongo/shell/utils.js:25:13
DB.prototype.grantRolesToUser@src/mongo/shell/db.js:1635:15
@(shell):1:1
bye
MongoDB shell version v4.2.17
connecting to: mongodb://localhost:27017/admin?compressors=disabled&gssapiServiceName=mongodb
Implicit session: session { "id" : UUID("0e95023f-9a00-4201-8a07-7f13608ed26b") }
MongoDB server version: 4.2.17
{
	"ok" : 0,
	"errmsg" : "not master",
	"code" : 10107,
	"codeName" : "NotWritablePrimary"
}
2021-10-11T04:28:05.745+0000 E  QUERY    [js] uncaught exception: Error: couldn't add user: not master :
_getErrorWithCode@src/mongo/shell/utils.js:25:13
DB.prototype.createUser@src/mongo/shell/db.js:1413:11
@(shell):1:1
bye
Recreating postgres ... 
Recreating postgres      ... done
Recreating elasticsearch ... done
Recreating zookeeper     ... done
Creating minio           ... done
Recreating redis           ... done
Recreating airflow       ... done
Recreating api           ... done
Recreating proxy           ... done
Recreating kafka         ... done
Recreating schema-registry ... done
Recreating connect         ... done

OUTPUT of docker ps command:

CONTAINER ID        IMAGE                                   COMMAND                  CREATED             STATUS                             PORTS                                        NAMES
0bc8ef3be938        dataengineeringproject_connect          "./register_connecto…"   40 seconds ago      Up 39 seconds (health: starting)   0.0.0.0:8083->8083/tcp, 9092/tcp             connect
11abb03946fb        confluentinc/cp-schema-registry:5.3.1   "/etc/confluent/dock…"   41 seconds ago      Up 40 seconds                      8081/tcp                                     schema-registry
96ffc1d68f8b        dataengineeringproject_kafka            "./create_default_to…"   41 seconds ago      Up 40 seconds                      9092/tcp                                     kafka
cf36bccafb4b        dataengineeringproject_proxy            "/docker-entrypoint.…"   42 seconds ago      Up 40 seconds                      0.0.0.0:5000->5000/tcp, 8080/tcp             proxy
49266edc8a5f        dataengineeringproject_api              "./run_api.sh"           42 seconds ago      Up 42 seconds                                                                   api
3f776a6022ed        dataengineeringproject_airflow          "/entrypoint.sh webs…"   43 seconds ago      Up 41 seconds (healthy)            5555/tcp, 8793/tcp, 0.0.0.0:8080->8080/tcp   airflow
2bfc84caa93f        redis:alpine                            "docker-entrypoint.s…"   43 seconds ago      Up 40 seconds                      0.0.0.0:6379->6379/tcp                       redis
ada4f43e8a2e        confluentinc/cp-zookeeper:5.3.1         "/etc/confluent/dock…"   43 seconds ago      Up 41 seconds                      2888/tcp, 0.0.0.0:2181->2181/tcp, 3888/tcp   zookeeper
d9e5f3169391        dataengineeringproject_elasticsearch    "/tini -- /usr/local…"   43 seconds ago      Up 42 seconds                      0.0.0.0:9200->9200/tcp, 9300/tcp             elasticsearch
d9ce588ea654        postgres:9.6                            "docker-entrypoint.s…"   43 seconds ago      Up 42 seconds                      5432/tcp                                     postgres
bbb7edd6cf8d        mongo:4.2                               "docker-entrypoint.s…"   55 seconds ago      Up 54 seconds                      0.0.0.0:27017->27017/tcp                     mongo

Improve proxy healthcheck

Currently, is there is no valid proxy in 5 tries on any exporter, the DAG fails. The usual cause of errors is proxy health issue (event after 5 retries), like in the below logs:

*** Reading local file: /usr/local/airflow/logs/rss_news_dag/exporting_101greatgoals_news_to_broker/2020-10-11T08:10:00+00:00/1.log
[2020-10-11 08:21:15,995] {{taskinstance.py:655}} INFO - Dependencies all met for <TaskInstance: rss_news_dag.exporting_101greatgoals_news_to_broker 2020-10-11T08:10:00+00:00 [queued]>
[2020-10-11 08:21:16,029] {{taskinstance.py:655}} INFO - Dependencies all met for <TaskInstance: rss_news_dag.exporting_101greatgoals_news_to_broker 2020-10-11T08:10:00+00:00 [queued]>
[2020-10-11 08:21:16,029] {{taskinstance.py:866}} INFO - 
--------------------------------------------------------------------------------
[2020-10-11 08:21:16,029] {{taskinstance.py:867}} INFO - Starting attempt 1 of 1
[2020-10-11 08:21:16,029] {{taskinstance.py:868}} INFO - 
--------------------------------------------------------------------------------
[2020-10-11 08:21:16,053] {{taskinstance.py:887}} INFO - Executing <Task(PythonOperator): exporting_101greatgoals_news_to_broker> on 2020-10-11T08:10:00+00:00
[2020-10-11 08:21:16,055] {{standard_task_runner.py:53}} INFO - Started process 2164 to run task
[2020-10-11 08:21:16,162] {{logging_mixin.py:112}} INFO - Running %s on host %s <TaskInstance: rss_news_dag.exporting_101greatgoals_news_to_broker 2020-10-11T08:10:00+00:00 [running]> cfc5513180c6
[2020-10-11 08:21:16,195] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:16,195] {{retry_on_exception.py:14}} INFO - Retries: 5
[2020-10-11 08:21:16,201] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:16,200] {{conn.py:378}} INFO - <BrokerConnection node_id=bootstrap-0 host=kafka:9092 <connecting> [IPv4 ('172.19.0.10', 9092)]>: connecting to kafka:9092 [('172.19.0.10', 9092) IPv4]
[2020-10-11 08:21:16,201] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:16,201] {{conn.py:1195}} INFO - Probing node bootstrap-0 broker version
[2020-10-11 08:21:16,202] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:16,202] {{conn.py:407}} INFO - <BrokerConnection node_id=bootstrap-0 host=kafka:9092 <connecting> [IPv4 ('172.19.0.10', 9092)]>: Connection complete.
[2020-10-11 08:21:16,307] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:16,307] {{conn.py:1257}} INFO - Broker version identified as 1.0.0
[2020-10-11 08:21:16,307] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:16,307] {{conn.py:1259}} INFO - Set configuration api_version=(1, 0, 0) to skip auto check_version requests on startup
[2020-10-11 08:21:16,366] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:16,366] {{main.py:20}} INFO - {'http': 'http://181.129.70.82:46752', 'https': 'http://181.129.70.82:46752'}
[2020-10-11 08:21:46,395] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:46,394] {{web_parser.py:34}} INFO - Error occurred: HTTPSConnectionPool(host='www.101greatgoals.com', port=443): Max retries exceeded with url: /feed/ (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f7373c37190>, 'Connection to 181.129.70.82 timed out. (connect timeout=30)'))
[2020-10-11 08:21:46,395] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:46,395] {{kafka.py:471}} INFO - Closing the Kafka producer with 9223372036.0 secs timeout.
[2020-10-11 08:21:46,396] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:46,396] {{conn.py:916}} INFO - <BrokerConnection node_id=bootstrap-0 host=kafka:9092 <connected> [IPv4 ('172.19.0.10', 9092)]>: Closing connection. 
[2020-10-11 08:21:46,397] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:46,396] {{retry_on_exception.py:20}} INFO - Error occured: Not a valid XML document
[2020-10-11 08:21:46,397] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:46,397] {{retry_on_exception.py:29}} INFO - Retries: 4
[2020-10-11 08:21:46,399] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:46,399] {{conn.py:378}} INFO - <BrokerConnection node_id=bootstrap-0 host=kafka:9092 <connecting> [IPv4 ('172.19.0.10', 9092)]>: connecting to kafka:9092 [('172.19.0.10', 9092) IPv4]
[2020-10-11 08:21:46,400] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:46,399] {{conn.py:1195}} INFO - Probing node bootstrap-0 broker version
[2020-10-11 08:21:46,400] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:46,400] {{conn.py:407}} INFO - <BrokerConnection node_id=bootstrap-0 host=kafka:9092 <connecting> [IPv4 ('172.19.0.10', 9092)]>: Connection complete.
[2020-10-11 08:21:46,505] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:46,505] {{conn.py:1257}} INFO - Broker version identified as 1.0.0
[2020-10-11 08:21:46,505] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:46,505] {{conn.py:1259}} INFO - Set configuration api_version=(1, 0, 0) to skip auto check_version requests on startup
[2020-10-11 08:21:46,513] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:46,513] {{main.py:20}} INFO - {'http': 'http://185.74.4.47:8080', 'https': 'http://185.74.4.47:8080'}
[2020-10-11 08:21:46,743] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:46,743] {{web_parser.py:34}} INFO - Error occurred: HTTPSConnectionPool(host='www.101greatgoals.com', port=443): Max retries exceeded with url: /feed/ (Caused by ProxyError('Cannot connect to proxy.', ConnectionResetError(104, 'Connection reset by peer')))
[2020-10-11 08:21:46,744] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:46,743] {{kafka.py:471}} INFO - Closing the Kafka producer with 9223372036.0 secs timeout.
[2020-10-11 08:21:46,744] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:46,744] {{conn.py:916}} INFO - <BrokerConnection node_id=bootstrap-0 host=kafka:9092 <connected> [IPv4 ('172.19.0.10', 9092)]>: Closing connection. 
[2020-10-11 08:21:46,745] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:46,745] {{retry_on_exception.py:20}} INFO - Error occured: Not a valid XML document
[2020-10-11 08:21:46,745] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:46,745] {{retry_on_exception.py:29}} INFO - Retries: 3
[2020-10-11 08:21:46,748] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:46,748] {{conn.py:378}} INFO - <BrokerConnection node_id=bootstrap-0 host=kafka:9092 <connecting> [IPv4 ('172.19.0.10', 9092)]>: connecting to kafka:9092 [('172.19.0.10', 9092) IPv4]
[2020-10-11 08:21:46,748] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:46,748] {{conn.py:1195}} INFO - Probing node bootstrap-0 broker version
[2020-10-11 08:21:46,749] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:46,749] {{conn.py:407}} INFO - <BrokerConnection node_id=bootstrap-0 host=kafka:9092 <connecting> [IPv4 ('172.19.0.10', 9092)]>: Connection complete.
[2020-10-11 08:21:46,854] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:46,854] {{conn.py:1257}} INFO - Broker version identified as 1.0.0
[2020-10-11 08:21:46,854] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:46,854] {{conn.py:1259}} INFO - Set configuration api_version=(1, 0, 0) to skip auto check_version requests on startup
[2020-10-11 08:21:46,856] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:46,856] {{kafka.py:461}} INFO - Kafka producer closed
[2020-10-11 08:21:46,859] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:46,858] {{main.py:20}} INFO - {'http': 'http://165.22.36.75:8888', 'https': 'http://165.22.36.75:8888'}
[2020-10-11 08:21:47,811] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:47,811] {{web_parser.py:32}} INFO - Bad response
[2020-10-11 08:21:47,812] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:47,812] {{kafka.py:471}} INFO - Closing the Kafka producer with 9223372036.0 secs timeout.
[2020-10-11 08:21:47,812] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:47,812] {{conn.py:916}} INFO - <BrokerConnection node_id=bootstrap-0 host=kafka:9092 <connected> [IPv4 ('172.19.0.10', 9092)]>: Closing connection. 
[2020-10-11 08:21:47,813] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:47,813] {{retry_on_exception.py:20}} INFO - Error occured: Not a valid XML document
[2020-10-11 08:21:47,813] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:47,813] {{retry_on_exception.py:29}} INFO - Retries: 2
[2020-10-11 08:21:47,816] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:47,816] {{conn.py:378}} INFO - <BrokerConnection node_id=bootstrap-0 host=kafka:9092 <connecting> [IPv4 ('172.19.0.10', 9092)]>: connecting to kafka:9092 [('172.19.0.10', 9092) IPv4]
[2020-10-11 08:21:47,817] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:47,816] {{conn.py:1195}} INFO - Probing node bootstrap-0 broker version
[2020-10-11 08:21:47,817] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:47,817] {{conn.py:407}} INFO - <BrokerConnection node_id=bootstrap-0 host=kafka:9092 <connecting> [IPv4 ('172.19.0.10', 9092)]>: Connection complete.
[2020-10-11 08:21:47,923] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:47,923] {{conn.py:1257}} INFO - Broker version identified as 1.0.0
[2020-10-11 08:21:47,923] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:47,923] {{conn.py:1259}} INFO - Set configuration api_version=(1, 0, 0) to skip auto check_version requests on startup
[2020-10-11 08:21:47,927] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:47,927] {{main.py:20}} INFO - {'http': 'http://139.5.71.199:8080', 'https': 'http://139.5.71.199:8080'}
[2020-10-11 08:22:17,949] {{logging_mixin.py:112}} INFO - [2020-10-11 08:22:17,949] {{web_parser.py:34}} INFO - Error occurred: HTTPSConnectionPool(host='www.101greatgoals.com', port=443): Max retries exceeded with url: /feed/ (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f7373a8ae50>, 'Connection to 139.5.71.199 timed out. (connect timeout=30)'))
[2020-10-11 08:22:17,950] {{logging_mixin.py:112}} INFO - [2020-10-11 08:22:17,950] {{kafka.py:471}} INFO - Closing the Kafka producer with 9223372036.0 secs timeout.
[2020-10-11 08:22:17,951] {{logging_mixin.py:112}} INFO - [2020-10-11 08:22:17,950] {{conn.py:916}} INFO - <BrokerConnection node_id=bootstrap-0 host=kafka:9092 <connected> [IPv4 ('172.19.0.10', 9092)]>: Closing connection. 
[2020-10-11 08:22:17,951] {{logging_mixin.py:112}} INFO - [2020-10-11 08:22:17,951] {{retry_on_exception.py:20}} INFO - Error occured: Not a valid XML document
[2020-10-11 08:22:17,952] {{logging_mixin.py:112}} INFO - [2020-10-11 08:22:17,951] {{retry_on_exception.py:29}} INFO - Retries: 1
[2020-10-11 08:22:17,954] {{logging_mixin.py:112}} INFO - [2020-10-11 08:22:17,954] {{conn.py:378}} INFO - <BrokerConnection node_id=bootstrap-0 host=kafka:9092 <connecting> [IPv4 ('172.19.0.10', 9092)]>: connecting to kafka:9092 [('172.19.0.10', 9092) IPv4]
[2020-10-11 08:22:17,955] {{logging_mixin.py:112}} INFO - [2020-10-11 08:22:17,955] {{conn.py:1195}} INFO - Probing node bootstrap-0 broker version
[2020-10-11 08:22:17,956] {{logging_mixin.py:112}} INFO - [2020-10-11 08:22:17,955] {{conn.py:407}} INFO - <BrokerConnection node_id=bootstrap-0 host=kafka:9092 <connecting> [IPv4 ('172.19.0.10', 9092)]>: Connection complete.
[2020-10-11 08:22:18,061] {{logging_mixin.py:112}} INFO - [2020-10-11 08:22:18,060] {{conn.py:1257}} INFO - Broker version identified as 1.0.0
[2020-10-11 08:22:18,061] {{logging_mixin.py:112}} INFO - [2020-10-11 08:22:18,061] {{conn.py:1259}} INFO - Set configuration api_version=(1, 0, 0) to skip auto check_version requests on startup
[2020-10-11 08:22:18,065] {{logging_mixin.py:112}} INFO - [2020-10-11 08:22:18,065] {{main.py:20}} INFO - {'http': 'http://185.74.4.47:8080', 'https': 'http://185.74.4.47:8080'}
[2020-10-11 08:22:18,302] {{logging_mixin.py:112}} INFO - [2020-10-11 08:22:18,302] {{web_parser.py:34}} INFO - Error occurred: HTTPSConnectionPool(host='www.101greatgoals.com', port=443): Max retries exceeded with url: /feed/ (Caused by ProxyError('Cannot connect to proxy.', ConnectionResetError(104, 'Connection reset by peer')))
[2020-10-11 08:22:18,302] {{logging_mixin.py:112}} INFO - [2020-10-11 08:22:18,302] {{kafka.py:471}} INFO - Closing the Kafka producer with 9223372036.0 secs timeout.
[2020-10-11 08:22:18,303] {{logging_mixin.py:112}} INFO - [2020-10-11 08:22:18,303] {{conn.py:916}} INFO - <BrokerConnection node_id=bootstrap-0 host=kafka:9092 <connected> [IPv4 ('172.19.0.10', 9092)]>: Closing connection. 
[2020-10-11 08:22:18,304] {{logging_mixin.py:112}} INFO - [2020-10-11 08:22:18,304] {{retry_on_exception.py:20}} INFO - Error occured: Not a valid XML document
[2020-10-11 08:22:18,304] {{taskinstance.py:1128}} ERROR - Not a valid XML document
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/xml/etree/ElementTree.py", line 1637, in close
    self.parser.Parse("", 1) # end of data
xml.parsers.expat.ExpatError: no element found: line 1, column 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/airflow/.local/lib/python3.7/site-packages/atoma/utils.py", line 33, in parse_xml
    return defused_xml_parse(xml_content)
  File "/usr/local/lib/python3.7/site-packages/defusedxml/common.py", line 105, in parse
    return _parse(source, parser)
  File "/usr/local/lib/python3.7/xml/etree/ElementTree.py", line 1197, in parse
    tree.parse(source, parser)
  File "/usr/local/lib/python3.7/xml/etree/ElementTree.py", line 605, in parse
    self._root = parser.close()
  File "/usr/local/lib/python3.7/xml/etree/ElementTree.py", line 1639, in close
    self._raiseerror(v)
  File "/usr/local/lib/python3.7/xml/etree/ElementTree.py", line 1531, in _raiseerror
    raise err
  File "<string>", line None
xml.etree.ElementTree.ParseError: no element found: line 1, column 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 966, in _run_raw_task
    result = task_copy.execute(context=context)
  File "/usr/local/lib/python3.7/site-packages/airflow/operators/python_operator.py", line 113, in execute
    return_value = self.execute_callable()
  File "/usr/local/lib/python3.7/site-packages/airflow/operators/python_operator.py", line 118, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
  File "/usr/local/airflow/modules/retry/retry_on_exception.py", line 22, in wrapper
    self._raise_on_condition(self._retries, err)
  File "/usr/local/airflow/modules/retry/retry_on_exception.py", line 27, in _raise_on_condition
    raise exception
  File "/usr/local/airflow/modules/retry/retry_on_exception.py", line 17, in wrapper
    return function(*args, **kwargs)
  File "/usr/local/airflow/modules/rss_news/main.py", line 21, in export_news_to_broker
    for news in NewsProducer(rss_feed).get_news_stream(proxy):
  File "/usr/local/airflow/modules/rss_news/rss_news_producer.py", line 34, in get_news_stream
    news_feed_items = self._extract_news_feed_items(proxies)
  File "/usr/local/airflow/modules/rss_news/rss_news_producer.py", line 30, in _extract_news_feed_items
    news_feed = atoma.parse_rss_bytes(content)
  File "/usr/local/airflow/.local/lib/python3.7/site-packages/atoma/rss.py", line 217, in parse_rss_bytes
    root = parse_xml(BytesIO(data)).getroot()
  File "/usr/local/airflow/.local/lib/python3.7/site-packages/atoma/utils.py", line 35, in parse_xml
    raise FeedXMLError('Not a valid XML document')
atoma.exceptions.FeedXMLError: Not a valid XML document
[2020-10-11 08:22:18,307] {{taskinstance.py:1185}} INFO - Marking task as FAILED.dag_id=rss_news_dag, task_id=exporting_101greatgoals_news_to_broker, execution_date=20201011T081000, start_date=20201011T082115, end_date=20201011T082218
[2020-10-11 08:22:21,351] {{logging_mixin.py:112}} INFO - [2020-10-11 08:22:21,351] {{local_task_job.py:103}} INFO - Task exited with return code 1

Consider doing a healthcheck of proxy (like GET google.com) before running export_news_to_broker to ensure that task failure indicates error in the exporter and not a last of working proxy in a pool.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.