Comments (9)
Here's what I had previously. I updated the build_chunks function and added a new function for chunking tables. I haven't done a lot of testing on this, but it seemed to work well. One issue I can foresee is where the table headers alone exceed the token limit, it would not work.
def build_chunks(self, document_map, myblob_name, myblob_uri, chunk_target_size):
""" Function to build chunk outputs based on the document map """
chunk_text = ''
chunk_size = 0
file_number = 0
page_number = 0
previous_section_name = document_map['structure'][0]['section']
previous_title_name = document_map['structure'][0]["title"]
previous_subtitle_name = document_map['structure'][0]["subtitle"]
page_list = []
chunk_count = 0
def finalize_chunk():
nonlocal chunk_text, chunk_count, chunk_size, file_number, page_list, page_number
if chunk_text: # Only write out if there is text to write
self.write_chunk(myblob_name, myblob_uri, file_number,
chunk_size, chunk_text, page_list,
previous_section_name, previous_title_name, previous_subtitle_name)
chunk_count += 1
file_number += 1 # Increment the file/chunk number
# Reset the chunk variables
chunk_text = ''
chunk_size = 0
page_list = []
page_number = 0 # Reset the page_number for the new chunk
for index, paragraph_element in enumerate(document_map['structure']):
paragraph_size = self.token_count(paragraph_element["text"])
paragraph_text = paragraph_element["text"]
section_name = paragraph_element["section"]
title_name = paragraph_element["title"]
subtitle_name = paragraph_element["subtitle"]
# Handle table paragraphs separately
if paragraph_element["type"] == "table":
# Check if the table needs to be split into multiple chunks
if paragraph_size > chunk_target_size:
# Split the table into chunks with headers
table_chunks = self.chunk_table_with_headers(paragraph_text, chunk_target_size)
for table_chunk in table_chunks:
finalize_chunk() # Finalize the previous chunk before starting a new one
chunk_text = minify_html.minify(table_chunk) # Set the current chunk to the table chunk
chunk_size = self.token_count(chunk_text) # Update the chunk size
finalize_chunk() # Finalize the current table chunk
continue # Skip to the next paragraph element
# Check if a new chunk should be started
if (chunk_size + paragraph_size >= chunk_target_size) or \
(section_name != previous_section_name) or \
(title_name != previous_title_name) or \
(subtitle_name != previous_subtitle_name):
finalize_chunk()
# Add paragraph to the chunk
chunk_text += "\n" + paragraph_text
chunk_size += paragraph_size
if page_number != paragraph_element["page_number"]:
page_list.append(paragraph_element["page_number"])
page_number = paragraph_element["page_number"]
# Update previous section, title, and subtitle
previous_section_name = section_name
previous_title_name = title_name
previous_subtitle_name = subtitle_name
# Finalize the last chunk after the loop
if index == len(document_map['structure']) - 1:
finalize_chunk()
logging.info("Chunking is complete")
return chunk_count
def chunk_table_with_headers(self, table_html, chunk_target_size):
soup = BeautifulSoup(table_html, 'html.parser')
# Check for and extract the thead and tbody, or default to entire table
thead = soup.find('thead')
tbody = soup.find('tbody') or soup.find('table')
rows = soup.find_all('tr') if not tbody else tbody.find_all('tr')
header_html = f"<table>{minify_html.minify(str(thead))}" if thead else "<table>"
# Initialize chunks list and current_chunk with the header
current_chunk = header_html
chunks = []
def add_current_chunk():
nonlocal current_chunk
# Close the table tag for the current chunk and add it to the chunks list
if current_chunk.strip() and not current_chunk.endswith("<table>"):
current_chunk += '</table>'
chunks.append(current_chunk)
# Start a new chunk with header if it exists
current_chunk = header_html
for row in rows:
# If adding this row to the current chunk exceeds the target size, start a new chunk
row_html = minify_html.minify(str(row))
if self.token_count(current_chunk + row_html) > chunk_target_size:
add_current_chunk()
# Add the current row to the chunk
current_chunk += row_html
# Add the final chunk if there's any content left
add_current_chunk()
return chunks
from pubsec-info-assistant.
Hi @TaylorN15 we ran the file you provided the link to. It seems to relate to table processing. If a file is greater than our target token count for a chunk, this is not respected. We have added a task to our board to split tables by chunk size and repeat the table header rows in each chunk..
When we switched to using unstructured.io for non-PDF documents, we were aware of the same issue there. They were planning on adding this feature. So, we need to make the change in our code, and follow up with unstructured to confirm if this has been fixed and update that path also.
This issue has been updated to an enhancement
from pubsec-info-assistant.
Thanks @georearl. I actually wrote a function that will chunk a table whilst keeping header rows intact. I was using it in the previous version of the app to chunk Excel files before you introduced the Unstructured library. I can share the code if you’d like? It may assist?
from pubsec-info-assistant.
That would be great. please share the code, or feel free to create a PR
from pubsec-info-assistant.
@TaylorN15 checking in to see if you've made progress on this and if you will be submitting a PR?
from pubsec-info-assistant.
I had only made a start with the above code. It works in some cases and others not. And anything that gets run through the Unstructured library it won’t work with. I feel like it’s a key decision outside of my purview :)
from pubsec-info-assistant.
Thank you for the feedback. We'll keep this open for review.
from pubsec-info-assistant.
resolved and included in the code base. Thank you TaylorN15
from pubsec-info-assistant.
resolved and included in the code base. Thank you TaylorN15
@georearl - what if the table headers are larger than the chunk size as I mentioned earlier?
from pubsec-info-assistant.
Related Issues (20)
- Governance Infused Ingestion, Embedding and RAG HOT 1
- Release v1.1.1 streamed responses: only the first line is displayed HOT 3
- SharePoint feature issue HOT 3
- How to upgrade from gpt-35-turbo-16k to gpt-4o HOT 5
- Azure DevOps Pipeline or GitHub Action for deploying the solution as IaC HOT 2
- Instructure for setting up sandbox environment is outdate HOT 1
- Enrichment Web App Deployment fails. HOT 5
- Error: 500: Failed to embed HOT 3
- Missing Parameter in GOVCloud environment file
- WebApp fails to start due to tenacity version upgrading to 8.4.0 - 8.4.1 ( previously working version was tenacity 8.3.0 )
- Few-shot examples with energy conservation are used in prompts when chatting ungrounded with the models HOT 1
- Sharepoint ingest missing some files under subfolder HOT 3
- Segregate UI and backend
- how do we add a folder along with query to search in the specific folder, instead of doing an everywhere search? Is there way we can add this to the prompt HOT 1
- Very slow performance to process PDF and Word files HOT 3
- Upon uploading the new document, the website reveals the blob URL during the session HOT 4
- Function test intermittent failed, files were uploaded to "upload" container, but not processing to "content" container
- Unused/duplicate function purge_soft_deleted_blob HOT 1
- Documents stuck in embeddings-queue HOT 4
- Inconsistency in answers
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pubsec-info-assistant.