Comments (13)
The problem was not with the Hadoop cluster but with the parameters used by default when writing.
The parameters "dfs.client.use.datanode.hostname" and "dfs.datanode.use.datanode.hostname" must be forced to true when writing an external program.
Otherwise the internal IP of the server is used by the external server, and not the hostnames.
thank you for your help.
from docker-hadoop.
@MrHurt
I encounter the same situation, if you use the virtual machine, you need to expose datanode address ports
from docker-hadoop.
from docker-hadoop.
Thank you a lot for your reply.
The solution to open port 9864 works when you want to write from the server hosting the hadoop cluster. Cool, thank you!
But when I try to write from an external server, it doesn't work and I get the same error.
Do you have an idea?
from docker-hadoop.
Thank you a lot for your reply.
The solution to open port 9864 works when you want to write from the server hosting the hadoop cluster. Cool, thank you!
But when I try to write from an external server, it doesn't work and I get the same error.
Do you have an idea?
if you really want to fix this problem, you can debug the code, then you will find the answer.
when the request reach host, it can not known container ip, you need to write host information in /etc/host
from docker-hadoop.
@bde27 aren't these two parameters already forced to true in the entrypoint.sh?
...
if [ "$MULTIHOMED_NETWORK" = "1" ]; then
echo "Configuring for multihomed network"
# HDFS
addProperty /etc/hadoop/hdfs-site.xml dfs.namenode.rpc-bind-host 0.0.0.0
addProperty /etc/hadoop/hdfs-site.xml dfs.namenode.servicerpc-bind-host 0.0.0.0
addProperty /etc/hadoop/hdfs-site.xml dfs.namenode.http-bind-host 0.0.0.0
addProperty /etc/hadoop/hdfs-site.xml dfs.namenode.https-bind-host 0.0.0.0
addProperty /etc/hadoop/hdfs-site.xml dfs.client.use.datanode.hostname true
addProperty /etc/hadoop/hdfs-site.xml dfs.datanode.use.datanode.hostname true
...
I'm facing the same issue when trying to write or read files from pyspark in an external client. Could you please elaborate more on what you did to solve the problem?
from docker-hadoop.
The problem was not with the Hadoop cluster but with the parameters used by default when writing.
The parameters "dfs.client.use.datanode.hostname" and "dfs.datanode.use.datanode.hostname" must be forced to true when writing an external program.
Otherwise the internal IP of the server is used by the external server, and not the hostnames.thank you for your help.
The parameters "dfs.client.use.datanode.hostname" and "dfs.datanode.use.datanode.hostname" be forced to true, not working for me set up hadoop by docker
from docker-hadoop.
Hey guys,
Has anyone actually got a fix for this or have solved it? I'm facing the same problem with my Hadoop docker. I have run a simple wordcount test to see if everything is working fine, and it does but as soon as I have spark stream writing into it. HDFS doesn't seem to pick them up at all
`2020-12-07 09:20:58.212 WARN 1 --- [ool-22-thread-1] o.a.spark.streaming.CheckpointWriter : Could not write checkpoint for time 1607332854000 ms to file 'hdfs://namenode:8020/dangerousgoods/checkpoint/checkpoint-1607332858000'
2020-12-07 09:20:58.213 INFO 1 --- [uler-event-loop] o.a.spark.storage.memory.MemoryStore : Block broadcast_18 stored as values in memory (estimated size 17.2 KB, free 9.2 GB)
2020-12-07 09:20:58.214 INFO 1 --- [uler-event-loop] o.a.spark.storage.memory.MemoryStore : Block broadcast_18_piece0 stored as bytes in memory (estimated size 7.4 KB, free 9.2 GB)
2020-12-07 09:20:58.214 INFO 1 --- [er-event-loop-8] o.apache.spark.storage.BlockManagerInfo : Added broadcast_18_piece0 in memory on 16b1f170f11c:42679 (size: 7.4 KB, free: 9.2 GB)
2020-12-07 09:20:58.215 INFO 1 --- [uler-event-loop] org.apache.spark.SparkContext : Created broadcast 18 from broadcast at DAGScheduler.scala:1163
2020-12-07 09:20:58.215 INFO 1 --- [uler-event-loop] org.apache.spark.scheduler.DAGScheduler : Submitting 1 missing tasks from ShuffleMapStage 53 (MapPartitionsRDD[28] at mapToPair at RealtimeProcessor.java:256) (first 15 tasks are for partitions Vector(0))
2020-12-07 09:20:58.215 INFO 1 --- [uler-event-loop] o.a.spark.scheduler.TaskSchedulerImpl : Adding task set 53.0 with 1 tasks
2020-12-07 09:20:58.216 INFO 1 --- [er-event-loop-7] o.apache.spark.scheduler.TaskSetManager : Starting task 0.0 in stage 53.0 (TID 19, 10.0.9.185, executor 0, partition 0, PROCESS_LOCAL, 7760 bytes)
2020-12-07 09:20:58.221 INFO 1 --- [r-event-loop-10] o.apache.spark.storage.BlockManagerInfo : Added broadcast_18_piece0 in memory on 10.0.9.185:38567 (size: 7.4 KB, free: 366.2 MB)
2020-12-07 09:20:58.225 INFO 1 --- [result-getter-0] o.apache.spark.scheduler.TaskSetManager : Finished task 0.0 in stage 53.0 (TID 19) in 9 ms on 10.0.9.185 (executor 0) (1/1)
2020-12-07 09:20:58.225 INFO 1 --- [result-getter-0] o.a.spark.scheduler.TaskSchedulerImpl : Removed TaskSet 53.0, whose tasks have all completed, from pool
2020-12-07 09:20:58.226 INFO 1 --- [uler-event-loop] org.apache.spark.scheduler.DAGScheduler : ShuffleMapStage 53 (mapToPair at RealtimeProcessor.java:256) finished in 0.014 s
2020-12-07 09:20:58.226 INFO 1 --- [uler-event-loop] org.apache.spark.scheduler.DAGScheduler : looking for newly runnable stages
2020-12-07 09:20:58.226 INFO 1 --- [uler-event-loop] org.apache.spark.scheduler.DAGScheduler : running: Set()
2020-12-07 09:20:58.226 INFO 1 --- [uler-event-loop] org.apache.spark.scheduler.DAGScheduler : waiting: Set(ResultStage 55)
2020-12-07 09:20:58.226 INFO 1 --- [uler-event-loop] org.apache.spark.scheduler.DAGScheduler : failed: Set()
2020-12-07 09:20:58.227 INFO 1 --- [uler-event-loop] org.apache.spark.scheduler.DAGScheduler : Submitting ResultStage 55 (MapPartitionsRDD[33] at map at RealtimeProcessor.java:264), which has no missing parents
2020-12-07 09:20:58.227 INFO 1 --- [uler-event-loop] o.a.spark.storage.memory.MemoryStore : Block broadcast_19 stored as values in memory (estimated size 8.8 KB, free 9.2 GB)
2020-12-07 09:20:58.228 INFO 1 --- [uler-event-loop] o.a.spark.storage.memory.MemoryStore : Block broadcast_19_piece0 stored as bytes in memory (estimated size 4.4 KB, free 9.2 GB)
2020-12-07 09:20:58.229 INFO 1 --- [er-event-loop-0] o.apache.spark.storage.BlockManagerInfo : Added broadcast_19_piece0 in memory on 16b1f170f11c:42679 (size: 4.4 KB, free: 9.2 GB)
2020-12-07 09:20:58.229 INFO 1 --- [uler-event-loop] org.apache.spark.SparkContext : Created broadcast 19 from broadcast at DAGScheduler.scala:1163
2020-12-07 09:20:58.229 INFO 1 --- [uler-event-loop] org.apache.spark.scheduler.DAGScheduler : Submitting 1 missing tasks from ResultStage 55 (MapPartitionsRDD[33] at map at RealtimeProcessor.java:264) (first 15 tasks are for partitions Vector(0))`
that is the first error that prompts and after few sec, I get the exact same error like this post is titled
from docker-hadoop.
Hi anyone could fix this issue?
I'm using docker-hub image.
The parameters "dfs.client.use.datanode.hostname" and "dfs.datanode.use.datanode.hostname" are all true in my hdfs-site.xml but it still have that problem.
from docker-hadoop.
Any update on this issue?
from docker-hadoop.
I encounter the same situation? when I deploy docker-hadoop image in k8s cluster.
from docker-hadoop.
I encounter the same situation? when I deploy docker-hadoop image in k8s cluster.
just found out that the issue at my end was the connectivity between datanode and my local python app, after deploying my app in the same docker network as hadoop it was solved.
from docker-hadoop.
If you want to write from external hosts:
- Open datanode container port 9864 and 9866, then recreate containers.
- Add
${host ip} datanode namenode ${namenode container id} ${datanode container id}
to your local hosts file. - Set client configuration parameter
dfs.client.use.datanode.hostname
totrue
:
Configuration().apply {
this.set("dfs.client.use.datanode.hostname", "true")
}
If you do not want to add ${namenode container id} ${datanode container id}
to your local hosts file, you can set datanode container hostname to datanode
and namenode container hostname to namenode
by hostname
instruction to container configuraion in docker-compose.yaml
:
...
namenode:
image: bde2020/hadoop-namenode:2.0.0-hadoop3.2.1-java8
hostname: namenode
...
datanode:
image: bde2020/hadoop-datanode:2.0.0-hadoop3.2.1-java8
hostname: datanode
...
from docker-hadoop.
Related Issues (20)
- Code not working on M1 chip HOT 1
- Excuse me, do you have any practical experience in deploying docker Hadoop in k8s cluster
- Unable to establish connection to datanode when using pyhdfs to access HDFS HOT 1
- Cannot start docker images HOT 1
- Getting data node connection error HOT 3
- Can not connect to Ressourcemanager HOT 1
- Expand cluster by two more data nodes (performance is bad) HOT 1
- connection refused by submited flink cluster to docker Hadoop cluster on macOS. HOT 1
- Make build fails HOT 2
- Have Some Problem On MacOS M1 HOT 1
- Cannot access network interfaces. HOT 1
- hadoop.hive
- Can't push data to datanode with webhdfs HOT 1
- `base` image can not be built. HOT 2
- any newer version of hadoop cluster on docker like this
- How you create /hadoop/dfs/name dir without ROOT user
- Hive and Postgres Fails on MAC M2
- Execute the hadoop jar original-wordcount-1.0-SNAPSHOT.jar remove. TestWordCount Connection refusal
- Container exited with a non-zero exit code 1. Error file: prelaunch.err. HOT 1
- spark connect hadoop in docker
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from docker-hadoop.