Giter Site home page Giter Site logo

aws-samples / amazon-textract-enhancer Goto Github PK

View Code? Open in Web Editor NEW
66.0 7.0 34.0 936 KB

This workshop demonstrates how to build a Document parser and query engine with Amazon Textract and other services, such as ElasticSearch and DynamoDB.

License: MIT No Attribution

Python 90.13% HTML 2.67% JavaScript 7.20%

amazon-textract-enhancer's Issues

The stack fails when we change the region.

I get hte bellow error.... Error occurred while GetObject. S3 Error Code: PermanentRedirect. S3 Error Message: The bucket is in this region: us-east-1. Please use this region to retry the request (Service: AWSLambdaInternal; Status Code: 400; Error Code: InvalidParameterValueException; Request ID: 46f2bd08-7640-4156-bde3-32727516a887)

I need to create this in us-east-2 (Ohio) region..

Kindly help any help is truly appreciated.

KeyError: 'Blocks' bash: KeyError:: command not found

I am trying to extract tables from a pdf file and export them to a csv file. I run into this error:

blocks=response['Blocks']
[root@ip-172-31-72-49 centos]# KeyError: 'Blocks'

`import webbrowser, os
import json
import boto3
import io
from io import BytesIO
import sys
from pprint import pprint

def get_rows_columns_map(table_result, blocks_map):
rows = {}
for relationship in table_result['Relationships']:
if relationship['Type'] == 'CHILD':
for child_id in relationship['Ids']:
cell = blocks_map[child_id]
if cell['BlockType'] == 'CELL':
row_index = cell['RowIndex']
col_index = cell['ColumnIndex']
if row_index not in rows:
# create new row
rows[row_index] = {}

                # get the text value
                rows[row_index][col_index] = get_text(cell, blocks_map)
return rows

def get_text(result, blocks_map):
text = ''
if 'Relationships' in result:
for relationship in result['Relationships']:
if relationship['Type'] == 'CHILD':
for child_id in relationship['Ids']:
word = blocks_map[child_id]
if word['BlockType'] == 'WORD':
text += word['Text'] + ' '
if word['BlockType'] == 'SELECTION_ELEMENT':
if word['SelectionStatus'] =='SELECTED':
text += 'X '
return text

def get_table_csv_results(file_name):

with open(file_name, 'rb') as file:
    img_test = file.read()
    bytes_test = bytearray(img_test)
    print('Image loaded', file_name)

# process using image bytes
# get the results
client = boto3.client('textract')

response = client.start_document_text_detection(
DocumentLocation={
    'S3Object': {
        'Bucket': s3BucketName,
        'Name': documentName
    }
})

blocks=response['Blocks']
pprint(blocks)

blocks_map = {}
table_blocks = []
for block in blocks:
    blocks_map[block['Id']] = block
    if block['BlockType'] == "TABLE":
        table_blocks.append(block)

if len(table_blocks) <= 0:
    return "<b> NO Table FOUND </b>"

csv = ''
for index, table in enumerate(table_blocks):
    csv += generate_table_csv(table, blocks_map, index +1)
    csv += '\n\n'

return csv

def generate_table_csv(table_result, blocks_map, table_index):
rows = get_rows_columns_map(table_result, blocks_map)

table_id = 'Table_' + str(table_index)

# get cells.
csv = 'Table: {0}\n\n'.format(table_id)

for row_index, cols in rows.items():
    
    for col_index, text in cols.items():
        csv += '{}'.format(text) + ","
    csv += '\n'
    
csv += '\n\n\n'
return csv

def main(file_name):
table_csv = get_table_csv_results(file_name)

output_file = 'output.csv'

with open(output_file, "wt") as fout:
    fout.write(table_csv)

print('CSV OUTPUT FILE: ', output_file)

s3BucketName = "chrisyou.sagemi.com"
documentName = "DETAIL.pdf"

if name == "main":
file_name = sys.argv[1]
main(file_name)`

README.md error about synchronous calls

README.md mention:

"Making a synchronous call to query Textract API is not possible for multi-page PDF documents"

This phrase can be improved, since synchronous calls are not possible in any type of PDF file doesn't matter is this is a single-page or multi-page PDF document.

For reference: https://docs.aws.amazon.com/textract/latest/dg/API_DetectDocumentText.html
Documents for synchronous operations can be in PNG or JPEG format. Documents for asynchronous operations can also be in PDF format.

Non-provisionable CFN stack

Problem #1

  • Change bucket name for lambda function by global usage
  • Zip Lambda functions first(textract-lambda-code.zip) and upload to upper bucket
  • Set cloudwatch logs role arn for TextractDemoAPIDeployment

Feature request - Add API authentication or make API private

Hi,

Currently, the /retrievedocumentanalysisresult and /retrievetextdetectionresult are public and have no authentication. Recommend adding API authentication (e.g. IAM, Cognito, etc.) or making the API private to avoid the risk that someone test this project with sensitive prod docs and inadvertently leave them potentially publicly exposed. Yes, chances are slim as requester needs to know the bucket and key name for the API to return results, but still wanted to suggest this change.

S3 Event sends URL encoded key names, causing Lambda handler to fail on Textract API calls

Hi,

Per S3 docs:

The s3 key provides information about the bucket and object involved in the event. The object key name value is URL encoded. For example, "red flower.jpg" becomes "red+flower.jpg" (Amazon S3 returns "application/x-www-form-urlencoded" as the content type in the response).

The current extract-Enhancer-TextractAsyncJobSubmitFunction Lambda does not URL decode the S3 key received in the event JSON, so I'm receiving errors when trying to parse objects that contain spaces or other URL-encoding relevant characters.

For example, If I upload a document named my test.pdf, the S3 event sent to the extract-Enhancer-TextractAsyncJobSubmitFunction function contains the key Records[0].s3.object.key = my+test.pdf.

The Textract API calls textract.start_document_analysis() and textract.start_document_text_detection() then fail because the DocumentLocation parameter has a value of my+test.pdf when it should instead be my test.pdf.

Can you add URL decoding to the S3 key name in the received events?

Process fails when input file contains spaces

Lambda TextractAsyncJobSubmitFunction runs to success, but actually fails to process the file. The following errors can be seen in CloudWatch:

  • An error occurred (InvalidParameterException) when calling the StartDocumentAnalysis operation: Request has invalid parameters
  • An error occurred (InvalidParameterException) when calling the StartDocumentTextDetection operation: Request has invalid parameters

Two issues: 1. the problem specified above, and 2. the Lambda itself should probably fail under these circumstances (can't perform operation after retries).

Thanks for this code.

Issue with Policy Names

Hello, I am currently trying to implement the enhancer in my AWS environment. Unfortunately I always get a problem with the naming of the policy names. The Lambda functions assume the name "LambdaTextractRole" but the policy is created with the name " us-east-1-LambdaTextractRole". Also the CloudFormation is not executed completely (as you can see on the video).

Does anyone have any idea what a mistake this is?

BF3C05AB-09C7-462C-AC03-0CF2BD52745C

KeyError: 'Relationships' thrown when parsing Textract page results in textract_util.py

Hi,

I'm running the demo project with a PDF and receiving the error below. I'm working on investigating, but wanted to open the issue for tracking.

START RequestId: 0b976833-8688-42b4-9865-fe3e2955f984 Version: $LATEST
1 messages recieved
JobId = d1069774cc8edcbf66b12ff7497476fab0f653e80368d24de569e5e135a4e56b
Status = SUCCEEDED
Timestamp = 1567367135
API = StartDocumentTextDetection
JobTag = TextractTextDetectionJob-e925e04aeb58efc79afb74f5c6953cc2
S3ObjectName = My Test.pdf
S3Bucket = 544941453660-scanned-documents
upload_prefix = d1069774cc8edcbf66b12ff7497476fab0f653e80368d24de569e5e135a4e56b
Retrieved 1000 Blocks from Textract Text Detection response
Retrieved 1000 Blocks from Textract Text Detection response
Retrieved 1000 Blocks from Textract Text Detection response
Retrieved 1000 Blocks from Textract Text Detection response
Retrieved 1000 Blocks from Textract Text Detection response
Retrieved 890 Blocks from Textract Text Detection response
5890 Blocks retrieved
Extracted Block Types:
PAGE = 10
LINE = 2353
WORD = 3527
Page-1 contains 184 Lines
Page-2 contains 388 Lines
Page-3 contains 487 Lines
Page-4 contains 251 Lines
Page-5 contains 262 Lines
Page-6 contains 352 Lines
Page-7 contains 371 Lines
Page-8 contains 23 Lines
Page-9 contains 35 Lines
'Relationships': KeyError
Traceback (most recent call last):
File "/var/task/detect-text-postprocess-page.py", line 74, in lambda_handler
document_text, num_lines = extractTextBody(blocks)
File "/var/task/textract_util.py", line 423, in extractTextBody
print("Page-
{}
contains
{}
Lines".format(page['Page'], len(page['Relationships'][0]['Ids'])))
KeyError: 'Relationships'

END RequestId: 0b976833-8688-42b4-9865-fe3e2955f984

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.