Comments (15)
Hi @Never-D
2024-03-19 11:19:15,682 WARN ALLUXIO-PROXY-WEB-SERVICE-157 - Failed to read block 21508390913 of file /shein-os/cos-alluxio-test/data/upload-test/2/nexus-test/aws-sdk-cpp-v1.0.tar.gz from worker WorkerNetAddress{host=10.121.0.207, containerHost=, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.121.0.207, rack=null)}. This worker will be skipped for future read operations, will retry: alluxio.exception.status.UnavailableException: io exception.
Can you check the log of alluxio-worker? Generally it's caused by short of direct memory at the alluxio-worker side when reading concurrently. Increase the value of -XX:MaxDirectMemorySize
for the alluxio-worker process may help.
For the IOException: Broken pipe.
exception, one idea is to catch it and retry at your http service side.
from alluxio.
Could you show your alluxio configuration? Then I can find whether it is just a configuration problem.
from alluxio.
@YichuanSun
Master and worker node information: 4C 32G
master configuration:
alluxio.master.hostname=${localip}
alluxio.master.embedded.journal.addresses=${alluxio_master_ip01}:19200,${alluxio_master_ip02}:19200,${alluxio_master_ip03}:19200
alluxio.master.mount.table.root.ufs=cos://lt-cubesats-alluxio-prod/alluxio/
fs.cos.access.key=${cos_cubesats_alluxio_accessKeyId}
fs.cos.app.id=1259571579
fs.cos.connection.max=4096
fs.cos.connection.timeout=50sec
fs.cos.region=ap-nanjing
fs.cos.secret.key=${cos_cubesats_alluxio_secretKey}
fs.cos.socket.timeout=50sec
#开启自动加载缓存并配置缓存目录:该目录在上传后和从对象存储发现后 马上进行缓存
alluxio.master.data.async.cache.enabled=true
alluxio.master.data.async.cache.file.path=/shein-os/cos-alluxio/data
alluxio.user.file.replication.durable=2
alluxio.master.worker.timeout=180sec
# 元数据刷新间隔
alluxio.user.file.metadata.sync.interval=30min
# 元数据管理
alluxio.master.metastore.dir=/data01/metastore
alluxio.master.journal.folder=/data01/journal
alluxio.security.authorization.permission.enabled=false
# 用户模拟
alluxio.master.security.impersonation.hadoop.users=*
alluxio.master.security.impersonation.hadoop.groups=*
alluxio.master.security.impersonation.client.users=*
alluxio.master.security.impersonation.client.groups=*
alluxio.master.security.impersonation.yarn.users=*
alluxio.master.security.impersonation.yarn.groups=*
# 禁止local缓存 alluxio上远程存储数据
#alluxio.user.file.passive.cache.enabled=false
alluxio.user.file.writetype.default=THROUGH
alluxio.user.file.readtype.default=CACHE
# 解决莫名添加挂载桶文件大小0的空文件
alluxio.underfs.object.store.breadcrumbs.enabled=false
# fuse监控 配置
alluxio.fuse.web.enabled=true
alluxio.user.ufs.block.read.location.policy=alluxio.client.block.policy.CapacityBasedDeterministicHashPolicy
alluxio.user.client.cache.enabled=true
alluxio.user.client.cache.store.type=LOCAL
alluxio.user.client.cache.dirs=/home/hadoop
alluxio.user.client.cache.size=10GB
alluxio.user.client.cache.page.size=4MB
alluxio.master.shell.copy.file.buffer.size=8388608
alluxio.underfs.object.store.breadcrumbs.enabled=false
alluxio.user.network.writer.chunk.size.bytes=4MB
alluxio.user.client.cache.async.write.threads=32
alluxio.user.client.cache.timeout.threads=64
alluxio.user.client.cache.timeout.duration=30min
alluxio.user.network.reader.chunk.size.bytes=4MB
alluxio.user.streaming.reader.chunk.size.bytes=2MB
alluxio.user.streaming.reader.chunk.size.bytes=2MB
#alluxio.user.block.size.bytes.default=16MB
# worker Web处理的线程调大
alluxio.web.threads=4000
alluxio.network.connection.health.check.timeout.ms=180sec
alluxio.web.threaddump.log.enabled=true
alluxio.master.rpc.executor.max.pool.size=4000
alluxio.master.rpc.executor.core.pool.size=4000
alluxio.user.network.data.timeout.ms=30min
alluxio.user.streaming.data.timeout=30min
worker configuration:
alluxio.master.embedded.journal.addresses=${alluxio_master_ip01}:19200,${alluxio_master_ip02}:19200,${alluxio_master_ip03}:19200
# 缓存配置 启用二级缓存
alluxio.worker.tieredstore.levels=2
alluxio.worker.tieredstore.level0.alias=MEM
alluxio.worker.tieredstore.level0.dirs.path=/mnt/ramdisk
alluxio.worker.tieredstore.level0.dirs.mediumtype=MEM
alluxio.worker.tieredstore.level0.dirs.quota=1GB
alluxio.worker.tieredstore.level1.alias=HDD
alluxio.worker.tieredstore.level1.dirs.path=/data01/alluxio-namespace
alluxio.worker.tieredstore.level1.dirs.mediumtype=HDD
alluxio.worker.tieredstore.level1.dirs.quota=700GB
# 禁止local缓存 alluxio上远程存储数据
alluxio.user.file.passive.cache.enabled=false
alluxio.user.file.writetype.default=THROUGH
alluxio.user.file.readtype.default=CACHE
alluxio.worker.tieredstore.level0.watermark.high.ratio=0.70
alluxio.worker.tieredstore.level1.watermark.high.ratio=0.70
alluxio.security.authorization.permission.enabled=false
alluxio.network.ip.address.used=true
# 解决莫名添加挂载桶文件大小0的空文件
alluxio.underfs.object.store.breadcrumbs.enabled=false
alluxio.master.shell.copy.file.buffer.size=8388608
alluxio.underfs.object.store.breadcrumbs.enabled=false
alluxio.user.network.writer.chunk.size.bytes=4MB
alluxio.user.client.cache.async.write.threads=32
alluxio.user.client.cache.timeout.threads=64
alluxio.user.network.reader.chunk.size.bytes=4MB
alluxio.user.streaming.reader.chunk.size.bytes=2MB
alluxio.user.streaming.reader.chunk.size.bytes=2MB
alluxio.user.streaming.reader.close.timeout=30s
alluxio.consul.enabled=true
alluxio.consul.url=http://xxxx
alluxio.consul.service.name=prod-alluxio-server-ci-east-worker
alluxio.service.env.type=prod
alluxio.consul.service.tag=type=type=worker,disk=ssd,model=m6,cmdb-app-name=ci-alluxio,cmdb-name=ci-alluxio-cneast-prod-main
# worker Web处理的线程调大
alluxio.web.threads=4000
alluxio.network.connection.health.check.timeout.ms=180sec
alluxio.web.threaddump.log.enabled=true
alluxio.worker.management.load.detection.cool.down.time=60sec
alluxio.worker.free.space.timeout=180sec
alluxio.worker.master.periodical.rpc.timeout=30min
alluxio.worker.memory.size=21GB
alluxio.worker.network.block.reader.threads.max=4000
alluxio.worker.network.keepalive.time=30min
alluxio.worker.network.keepalive.timeout=30min
alluxio.worker.network.permit.keepalive.time=30min
alluxio.worker.network.netty.worker.threads=8
alluxio.worker.block.master.client.pool.size=30
alluxio.worker.rpc.executor.core.pool.size=4000
alluxio.worker.rpc.executor.max.pool.size=4000
The client only configured master information.
In addition, we found that when providing an HTTP download interface on the worker node, if only one worker is used for downloading, no problems are found; However, if multiple worker nodes are called to download files at the same time, some of the files will fail to download, similar to the phenomenon of using SDK.
The code for the download interface is as follows:
@GET
@Path(PATH_PARAM)
@ApiOperation(value = "Download the given file at the path", response = java.io.InputStream.class)
@Produces(MediaType.APPLICATION_OCTET_STREAM)
public Response downloadFile(@PathParam("path") final String path) throws IOException, AlluxioException {
AlluxioURI uri = new AlluxioURI("/" + path);
FileInStream is;
URIStatus status;
try {
if (!mFsClient.exists(uri)) {
mFsClient.loadMetadata(uri);
if (!mFsClient.exists(uri)) {
return Response.noContent().build();
}
}
// is = mFsClient.openFile(uri);
status = mFsClient.getStatus(uri);
} catch (IOException | AlluxioException e) {
return Response.status(500).entity(e.getMessage()).build();
}
StreamingOutput fileStream = output -> {
try (FileInStream input = mFsClient.openFile(uri)) {
byte[] buffer = new byte[1024];
int length;
while ((length = input.read(buffer)) != -1) {
output.write(buffer, 0, length);
output.flush();
}
} catch (AlluxioException e) {
throw new RuntimeException(e);
}
};
try {
return Response.ok(fileStream)
.header("Content-Disposition", "attachment; filename=" + uri.getName())
.header("Content-Length", status.getLength())
.build();
} catch (Exception e) {
return Response.status(500).entity(e.getMessage()).build();
}
}
from alluxio.
Have you found errors in the master, worker and proxy logs? Can you share the logs?
if only one worker is used for downloading, no problems are found; However, if multiple worker nodes are called to download files at the same time, some of the files will fail to download, similar to the phenomenon of using SDK.
alluxio.user.block.master.client.pool.size.max
You can try increasing this property.
from alluxio.
@jasondrogba error log is: java. io. IOException: Broken pipe org. apache. catalina. connector ClientAbortException: java. io. IOException: Broken pipe.
from alluxio.
@jasondrogba Is there any solution to the concurrency issue caused by adding a download interface similar to a proxy node to the worker node?
from alluxio.
@jasondrogba error log is: java. io. IOException: Broken pipe org. apache. catalina. connector ClientAbortException: java. io. IOException: Broken pipe.
@Never-D I guess this error message comes from springboot? It may be that the timeout period of the tomcat configuration or nginx configuration is too small.
https://stackoverflow.com/questions/43825908/org-apache-catalina-connector-clientabortexception-java-io-ioexception-apr-err
Most likely, your server is taking too long to respond and the client is getting bored and closing the connection.
A bit more explanation: tomcat receives a request on a connection and tries to fulfill it. Imagine this takes 3 minutes, now, if the client has a timeout of say 2 minutes, it will close the connection and when tomcat finally comes back to try to write the response, the connection is closed and it throws an org.apache.catalina.connector.ClientAbortException.
I think you can increase the timeout of springboot server and nginx, or increase the CPU and memory of the alluxio node.
You can share the alluxio log, and let’s take a look at what causes the concurrent processing timeout. It’s difficult for us to determine the cause just by the error report you shared. Have you found any errors in the master.log and worker.log under alluxio/logs? I hope you can share the errors in the alluxio logs.
from alluxio.
@jasondrogba 2024-03-19 11:18:58,227 INFO ALLUXIO-PROXY-WEB-SERVICE-224 - Alluxio S3 API received GET request: URI=http://alluxio-test-proxy.dev.sheincorp.cn/api/v1/paths/%2Fshein-os/cos-alluxio-test/data/upload-test/2/nexus-test/aws-sdk-cpp-v1.0.tar.gz/download-file User=null Media Type=null Query Parameters={} Path Parameters={}
2024-03-19 11:19:15,682 WARN ALLUXIO-PROXY-WEB-SERVICE-157 - Failed to read block 21508390913 of file /shein-os/cos-alluxio-test/data/upload-test/2/nexus-test/aws-sdk-cpp-v1.0.tar.gz from worker WorkerNetAddress{host=10.121.0.207, containerHost=, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.121.0.207, rack=null)}. This worker will be skipped for future read operations, will retry: alluxio.exception.status.UnavailableException: io exception.
from alluxio.
@jasondrogba Download 162MB file, the download size is incorrect, but there is no error message.
2024-03-19 12:50:27 (1.60 MB/s) - ‘aws-sdk-cpp-v1.0.tar.gz.184’ saved [58884016]
from alluxio.
This worker will be skipped for future read operations
According to this line, I found the error from AlluxioFileInStream
you can take a look on #16094 and #16096
export ALLUXIO_FUSE_JAVA_OPTS="-XX:MaxDirectMemorySize=128m"
you can try to increase MaxDirectMemorySize.
@secfree Hi~, do you have any idea about this error, I think you have more experience and intelligent, can help him with this issue.
from alluxio.
One possible reason: you have to close the FileSystem
instance at the end of your code, otherwise these FileSystem
objects are leakage resource. @Never-D
from alluxio.
One possible reason: you have to close the
FileSystem
instance at the end of your code, otherwise theseFileSystem
objects are leakage resource. @Never-D
Especially in such a high concurrency case.
from alluxio.
@YichuanSun The error I sent was a proxy error
from alluxio.
There are no error logs in the worker node
from alluxio.
Download 162MB file, the download size is incorrect, but there is no error message.
2024-03-19 12:50:27 (1.60 MB/s) - ‘aws-sdk-cpp-v1.0.tar.gz.184’ saved [58884016]
from alluxio.
Related Issues (20)
- Collector already registered that provides name: jmx_exporter_build_info HOT 1
- alluxio fs distributedLoad loading failure causes the fluid cache to be unusable HOT 5
- About pre-apply in JournalSystem HOT 2
- Neither Master.InodeHeapSize nor Master.BlockHeapSize can be found in metrics/json HOT 1
- Data consistency issue HOT 1
- ACL permission Issues HOT 1
- On an arm environment, you can use alluxi-fuse to mount directories, but you cannot perform some command operations on directories, such as the cd and ls commands HOT 3
- More than 100% data in Alluxio HOT 17
- Failed load job is rescheduled when Leadership switch
- there are many SIGSEGV message in the strace log of the alluxio fuse process HOT 3
- after using mmap, there's a small chance that the file read operation will get stuck for nearly 2 minutes HOT 4
- Kubernetes deployment error: Could not initialize class alluxio.conf.PropertyKey HOT 5
- code injection vulnerability of alluxio.util.ShellUtils.isAlluxioRunning HOT 1
- It's tough to run alluxio by docker HOT 1
- Build with hdfs failed HOT 2
- conf/alluxio-site.properties do not work HOT 1
- python-client problem HOT 4
- spark读写alluxio后 目录和文件 权限都变成了777 且 为 pinned HOT 3
- Mount the Kerberized HDFS using the Hadoop 3.2 can't autorenew Kerberos TGT
- Alluxio-Trino for Trino version 434 and higher HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from alluxio.