Giter Site home page Giter Site logo

Comments (15)

secfree avatar secfree commented on June 15, 2024 2

Hi @Never-D

2024-03-19 11:19:15,682 WARN ALLUXIO-PROXY-WEB-SERVICE-157 - Failed to read block 21508390913 of file /shein-os/cos-alluxio-test/data/upload-test/2/nexus-test/aws-sdk-cpp-v1.0.tar.gz from worker WorkerNetAddress{host=10.121.0.207, containerHost=, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.121.0.207, rack=null)}. This worker will be skipped for future read operations, will retry: alluxio.exception.status.UnavailableException: io exception.

Can you check the log of alluxio-worker? Generally it's caused by short of direct memory at the alluxio-worker side when reading concurrently. Increase the value of -XX:MaxDirectMemorySize for the alluxio-worker process may help.

For the IOException: Broken pipe. exception, one idea is to catch it and retry at your http service side.

from alluxio.

YichuanSun avatar YichuanSun commented on June 15, 2024

Could you show your alluxio configuration? Then I can find whether it is just a configuration problem.

from alluxio.

Never-D avatar Never-D commented on June 15, 2024

@YichuanSun
Master and worker node information: 4C 32G

master configuration:

alluxio.master.hostname=${localip}
alluxio.master.embedded.journal.addresses=${alluxio_master_ip01}:19200,${alluxio_master_ip02}:19200,${alluxio_master_ip03}:19200  
alluxio.master.mount.table.root.ufs=cos://lt-cubesats-alluxio-prod/alluxio/
fs.cos.access.key=${cos_cubesats_alluxio_accessKeyId}
fs.cos.app.id=1259571579
fs.cos.connection.max=4096
fs.cos.connection.timeout=50sec
fs.cos.region=ap-nanjing
fs.cos.secret.key=${cos_cubesats_alluxio_secretKey}
fs.cos.socket.timeout=50sec

#开启自动加载缓存并配置缓存目录:该目录在上传后和从对象存储发现后 马上进行缓存
alluxio.master.data.async.cache.enabled=true
alluxio.master.data.async.cache.file.path=/shein-os/cos-alluxio/data


alluxio.user.file.replication.durable=2
alluxio.master.worker.timeout=180sec
# 元数据刷新间隔
alluxio.user.file.metadata.sync.interval=30min
# 元数据管理
alluxio.master.metastore.dir=/data01/metastore
alluxio.master.journal.folder=/data01/journal
alluxio.security.authorization.permission.enabled=false
# 用户模拟
alluxio.master.security.impersonation.hadoop.users=*
alluxio.master.security.impersonation.hadoop.groups=*
alluxio.master.security.impersonation.client.users=*
alluxio.master.security.impersonation.client.groups=*
alluxio.master.security.impersonation.yarn.users=*
alluxio.master.security.impersonation.yarn.groups=*

# 禁止local缓存 alluxio上远程存储数据
#alluxio.user.file.passive.cache.enabled=false
alluxio.user.file.writetype.default=THROUGH
alluxio.user.file.readtype.default=CACHE

# 解决莫名添加挂载桶文件大小0的空文件
alluxio.underfs.object.store.breadcrumbs.enabled=false  
# fuse监控 配置
alluxio.fuse.web.enabled=true
alluxio.user.ufs.block.read.location.policy=alluxio.client.block.policy.CapacityBasedDeterministicHashPolicy
alluxio.user.client.cache.enabled=true
alluxio.user.client.cache.store.type=LOCAL
alluxio.user.client.cache.dirs=/home/hadoop
alluxio.user.client.cache.size=10GB
alluxio.user.client.cache.page.size=4MB


alluxio.master.shell.copy.file.buffer.size=8388608
alluxio.underfs.object.store.breadcrumbs.enabled=false
alluxio.user.network.writer.chunk.size.bytes=4MB
alluxio.user.client.cache.async.write.threads=32
alluxio.user.client.cache.timeout.threads=64
alluxio.user.client.cache.timeout.duration=30min
alluxio.user.network.reader.chunk.size.bytes=4MB
alluxio.user.streaming.reader.chunk.size.bytes=2MB
alluxio.user.streaming.reader.chunk.size.bytes=2MB

#alluxio.user.block.size.bytes.default=16MB

# worker Web处理的线程调大
alluxio.web.threads=4000
alluxio.network.connection.health.check.timeout.ms=180sec

alluxio.web.threaddump.log.enabled=true

alluxio.master.rpc.executor.max.pool.size=4000
alluxio.master.rpc.executor.core.pool.size=4000

alluxio.user.network.data.timeout.ms=30min
alluxio.user.streaming.data.timeout=30min

worker configuration:

alluxio.master.embedded.journal.addresses=${alluxio_master_ip01}:19200,${alluxio_master_ip02}:19200,${alluxio_master_ip03}:19200   

# 缓存配置 启用二级缓存
alluxio.worker.tieredstore.levels=2
alluxio.worker.tieredstore.level0.alias=MEM
alluxio.worker.tieredstore.level0.dirs.path=/mnt/ramdisk
alluxio.worker.tieredstore.level0.dirs.mediumtype=MEM
alluxio.worker.tieredstore.level0.dirs.quota=1GB
alluxio.worker.tieredstore.level1.alias=HDD
alluxio.worker.tieredstore.level1.dirs.path=/data01/alluxio-namespace
alluxio.worker.tieredstore.level1.dirs.mediumtype=HDD
alluxio.worker.tieredstore.level1.dirs.quota=700GB

# 禁止local缓存 alluxio上远程存储数据
alluxio.user.file.passive.cache.enabled=false
alluxio.user.file.writetype.default=THROUGH
alluxio.user.file.readtype.default=CACHE
alluxio.worker.tieredstore.level0.watermark.high.ratio=0.70
alluxio.worker.tieredstore.level1.watermark.high.ratio=0.70


alluxio.security.authorization.permission.enabled=false
alluxio.network.ip.address.used=true

# 解决莫名添加挂载桶文件大小0的空文件
alluxio.underfs.object.store.breadcrumbs.enabled=false

alluxio.master.shell.copy.file.buffer.size=8388608
alluxio.underfs.object.store.breadcrumbs.enabled=false
alluxio.user.network.writer.chunk.size.bytes=4MB
alluxio.user.client.cache.async.write.threads=32
alluxio.user.client.cache.timeout.threads=64
alluxio.user.network.reader.chunk.size.bytes=4MB
alluxio.user.streaming.reader.chunk.size.bytes=2MB
alluxio.user.streaming.reader.chunk.size.bytes=2MB
alluxio.user.streaming.reader.close.timeout=30s

alluxio.consul.enabled=true
alluxio.consul.url=http://xxxx
alluxio.consul.service.name=prod-alluxio-server-ci-east-worker
alluxio.service.env.type=prod
alluxio.consul.service.tag=type=type=worker,disk=ssd,model=m6,cmdb-app-name=ci-alluxio,cmdb-name=ci-alluxio-cneast-prod-main

# worker Web处理的线程调大
alluxio.web.threads=4000

alluxio.network.connection.health.check.timeout.ms=180sec

alluxio.web.threaddump.log.enabled=true

alluxio.worker.management.load.detection.cool.down.time=60sec
alluxio.worker.free.space.timeout=180sec
alluxio.worker.master.periodical.rpc.timeout=30min
alluxio.worker.memory.size=21GB
alluxio.worker.network.block.reader.threads.max=4000
alluxio.worker.network.keepalive.time=30min
alluxio.worker.network.keepalive.timeout=30min
alluxio.worker.network.permit.keepalive.time=30min
alluxio.worker.network.netty.worker.threads=8
alluxio.worker.block.master.client.pool.size=30
alluxio.worker.rpc.executor.core.pool.size=4000
alluxio.worker.rpc.executor.max.pool.size=4000

The client only configured master information.
In addition, we found that when providing an HTTP download interface on the worker node, if only one worker is used for downloading, no problems are found; However, if multiple worker nodes are called to download files at the same time, some of the files will fail to download, similar to the phenomenon of using SDK.

The code for the download interface is as follows:

  @GET
  @Path(PATH_PARAM)
  @ApiOperation(value = "Download the given file at the path", response = java.io.InputStream.class)
  @Produces(MediaType.APPLICATION_OCTET_STREAM)
  public Response downloadFile(@PathParam("path") final String path) throws IOException, AlluxioException {
    AlluxioURI uri = new AlluxioURI("/" + path);
    FileInStream is;
    URIStatus status;
    try {
      if (!mFsClient.exists(uri)) {
        mFsClient.loadMetadata(uri);

        if (!mFsClient.exists(uri)) {
          return Response.noContent().build();
        }
      }

//      is = mFsClient.openFile(uri);
      status = mFsClient.getStatus(uri);
    } catch (IOException | AlluxioException e) {
      return Response.status(500).entity(e.getMessage()).build();
    }

    StreamingOutput fileStream = output -> {
      try (FileInStream input = mFsClient.openFile(uri)) {
        byte[] buffer = new byte[1024];
        int length;
        while ((length = input.read(buffer)) != -1) {
          output.write(buffer, 0, length);
          output.flush();
        }
      } catch (AlluxioException e) {
        throw new RuntimeException(e);
      }
    };

      try {
        return Response.ok(fileStream)
            .header("Content-Disposition", "attachment; filename=" + uri.getName())
            .header("Content-Length", status.getLength())
            .build();
      } catch (Exception e) {
        return Response.status(500).entity(e.getMessage()).build();
      }
  }

from alluxio.

jasondrogba avatar jasondrogba commented on June 15, 2024

Have you found errors in the master, worker and proxy logs? Can you share the logs?

if only one worker is used for downloading, no problems are found; However, if multiple worker nodes are called to download files at the same time, some of the files will fail to download, similar to the phenomenon of using SDK.

alluxio.user.block.master.client.pool.size.max You can try increasing this property.

from alluxio.

Never-D avatar Never-D commented on June 15, 2024

@jasondrogba error log is: java. io. IOException: Broken pipe org. apache. catalina. connector ClientAbortException: java. io. IOException: Broken pipe.

from alluxio.

Never-D avatar Never-D commented on June 15, 2024

@jasondrogba Is there any solution to the concurrency issue caused by adding a download interface similar to a proxy node to the worker node?

from alluxio.

jasondrogba avatar jasondrogba commented on June 15, 2024

@jasondrogba error log is: java. io. IOException: Broken pipe org. apache. catalina. connector ClientAbortException: java. io. IOException: Broken pipe.

@Never-D I guess this error message comes from springboot? It may be that the timeout period of the tomcat configuration or nginx configuration is too small.
https://stackoverflow.com/questions/43825908/org-apache-catalina-connector-clientabortexception-java-io-ioexception-apr-err

Most likely, your server is taking too long to respond and the client is getting bored and closing the connection.
A bit more explanation: tomcat receives a request on a connection and tries to fulfill it. Imagine this takes 3 minutes, now, if the client has a timeout of say 2 minutes, it will close the connection and when tomcat finally comes back to try to write the response, the connection is closed and it throws an org.apache.catalina.connector.ClientAbortException.

I think you can increase the timeout of springboot server and nginx, or increase the CPU and memory of the alluxio node.
You can share the alluxio log, and let’s take a look at what causes the concurrent processing timeout. It’s difficult for us to determine the cause just by the error report you shared. Have you found any errors in the master.log and worker.log under alluxio/logs? I hope you can share the errors in the alluxio logs.

from alluxio.

Never-D avatar Never-D commented on June 15, 2024

@jasondrogba 2024-03-19 11:18:58,227 INFO ALLUXIO-PROXY-WEB-SERVICE-224 - Alluxio S3 API received GET request: URI=http://alluxio-test-proxy.dev.sheincorp.cn/api/v1/paths/%2Fshein-os/cos-alluxio-test/data/upload-test/2/nexus-test/aws-sdk-cpp-v1.0.tar.gz/download-file User=null Media Type=null Query Parameters={} Path Parameters={}
2024-03-19 11:19:15,682 WARN ALLUXIO-PROXY-WEB-SERVICE-157 - Failed to read block 21508390913 of file /shein-os/cos-alluxio-test/data/upload-test/2/nexus-test/aws-sdk-cpp-v1.0.tar.gz from worker WorkerNetAddress{host=10.121.0.207, containerHost=, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.121.0.207, rack=null)}. This worker will be skipped for future read operations, will retry: alluxio.exception.status.UnavailableException: io exception.

from alluxio.

Never-D avatar Never-D commented on June 15, 2024

@jasondrogba Download 162MB file, the download size is incorrect, but there is no error message.
2024-03-19 12:50:27 (1.60 MB/s) - ‘aws-sdk-cpp-v1.0.tar.gz.184’ saved [58884016]

from alluxio.

jasondrogba avatar jasondrogba commented on June 15, 2024

This worker will be skipped for future read operations

According to this line, I found the error from AlluxioFileInStream
you can take a look on #16094 and #16096

export ALLUXIO_FUSE_JAVA_OPTS="-XX:MaxDirectMemorySize=128m"

you can try to increase MaxDirectMemorySize.
@secfree Hi~, do you have any idea about this error, I think you have more experience and intelligent, can help him with this issue.

from alluxio.

YichuanSun avatar YichuanSun commented on June 15, 2024

One possible reason: you have to close the FileSystem instance at the end of your code, otherwise these FileSystem objects are leakage resource. @Never-D

from alluxio.

YichuanSun avatar YichuanSun commented on June 15, 2024

One possible reason: you have to close the FileSystem instance at the end of your code, otherwise these FileSystem objects are leakage resource. @Never-D

Especially in such a high concurrency case.

from alluxio.

Never-D avatar Never-D commented on June 15, 2024

@YichuanSun The error I sent was a proxy error

from alluxio.

Never-D avatar Never-D commented on June 15, 2024

There are no error logs in the worker node

from alluxio.

Never-D avatar Never-D commented on June 15, 2024

Download 162MB file, the download size is incorrect, but there is no error message.
2024-03-19 12:50:27 (1.60 MB/s) - ‘aws-sdk-cpp-v1.0.tar.gz.184’ saved [58884016]

from alluxio.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.