Comments (4)
I modified this locally to use a NLP library to detect sentences, I think it's a bit better at dealing with different media types, no PR because my local version is a kludgy mess atm, but here's my replacement CompleteChunks
function (note it has an extra parameter to define sentences to pull, so make sure to update the caller too:
import "github.com/jdkato/prose/v2"
func CreateChunks(fileContent string, window int, stride int, title string, chunkSize int) []Chunk {
doc, err := prose.NewDocument(fileContent)
if err != nil {
log.Fatal(err)
}
sentences := doc.Sentences() //strings.Split(fileContent, ".") // assuming sentences end with a period
newData := make([]Chunk, 0)
c := 0
text := ""
start := 0
end := 0
for si, _ := range sentences {
text += " " + sentences[si].Text
end = start + len(text)
if c == chunkSize || (c < chunkSize && si == len(sentences)) {
if checkTokenLimit(text) {
// only write chunks that are ok
newData = append(newData, Chunk{
Start: start,
End: end,
Title: title,
Text: text,
})
} else {
fmt.Println("chunk size too large!")
}
text = ""
c = 0
}
c++
start = end + 1
}
return newData
}
And a test:
func TestCreateChunks(t *testing.T) {
// 14 sentences
doc := `Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur pretium scelerisque lorem eget eleifend. Suspendisse condimentum libero at nisl commodo, ac pretium sapien convallis. Sed id lectus non justo varius semper sit amet in sapien. Proin arcu arcu, consequat fermentum tortor lacinia, tincidunt consectetur turpis. Donec iaculis tincidunt iaculis. Cras pulvinar mauris tempor lectus lacinia efficitur. Sed in nibh tellus. Curabitur molestie aliquet leo, non efficitur felis. Integer condimentum libero nec sapien ultrices accumsan. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nullam quis sagittis dui. Phasellus venenatis nulla quis ligula rutrum bibendum.`
chunks := CreateChunks(doc, 1, 1, "foo", 1)
if len(chunks) != 12 {
tx := ""
for _, s := range chunks {
tx += s.Text
}
fmt.Println(tx)
t.Fatalf("expected 12 chunks got %v\n", len(chunks))
}
}
(edit: forgot the import)
from vault-ai.
@Aemon-Algiz yeah, the chunking for different types of documents can be improved significantly... Some thoughts:
- Books and video/audio transcripts – 20 sentence chunks are largely fine. Only consideration as you pointed out is sentences can vary wildly in size, so estimated tik-token count would be nice to factor in, for some cases.
- Legal documents and code/manufacturing documentation – Structure (sections, subsections) is particularly important for these, so all things being equal, it would be better to ingest an entire section instead of 20 sentences.
- Code – obviously, code is not oriented around sentences. So a completely different chunking algorithm would be needed for documents containing code.
If anyone has any other thoughts, I'd be happy to hear them.
from vault-ai.
@Aemon-Algiz @lonelycode good idea using an NLP library. Does this library support most languages? I ended up going with github.com/neurosnap/sentences – check out my fix here 2bff175
from vault-ai.
// MaxTokensPerChunk is the maximum number of tokens allowed in a single chunk for OpenAI embeddings
const MaxTokensPerChunk = 500
const EmbeddingModel = "text-embedding-ada-002"
func CreateChunks(fileContent string, title string) ([]Chunk, error) {
tokenizer, _ := english.NewSentenceTokenizer(nil)
sentences := tokenizer.Tokenize(fileContent)
log.Println("[CreateChunks] getting tiktoken for", EmbeddingModel, "...")
// Get tiktoken encoding for the model
tiktoken, err := tke.EncodingForModel(EmbeddingModel)
if err != nil {
return []Chunk{}, fmt.Errorf("getEncoding: %v", err)
}
newData := make([]Chunk, 0)
position := 0
i := 0
for i < len(sentences) {
chunkTokens := 0
chunkSentences := []*s.Sentence{}
// Add sentences to the chunk until the token limit is reached
for i < len(sentences) {
tiktokens := tiktoken.Encode(sentences[i].Text, nil, nil)
tokenCount := len(tiktokens)
fmt.Printf(
"[CreateChunks] #%d Token count: %d | Total number of sentences: %d | Sentence: %s\n",
i, tokenCount, len(sentences), sentences[i].Text)
if chunkTokens+tokenCount <= MaxTokensPerChunk {
chunkSentences = append(chunkSentences, sentences[i])
chunkTokens += tokenCount
i++
} else {
log.Println("[CreateChunks] Adding this sentence would exceed max token limit. Breaking....")
break
}
}
if len(chunkSentences) > 0 {
text := strings.Join(sentencesToStrings(chunkSentences), "")
start := position
end := position + len(text)
fmt.Printf("[CreateChunks] Created chunk and adding it to the array...\nText: %s\n",
text)
newData = append(newData, Chunk{
Start: start,
End: end,
Title: title,
Text: text,
})
fmt.Printf("[CreateChunks] New chunk array length: %d\n",
len(newData))
position = end
// Set the stride for overlapping chunks
stride := len(chunkSentences) / 2
if stride < 1 {
stride = 1
}
oldI := i
i -= stride
// Check if the next sentence would still fit within the token limit
nextTokens := tiktoken.Encode(sentences[i].Text, nil, nil)
nextTokenCount := len(nextTokens)
if chunkTokens+nextTokenCount <= MaxTokensPerChunk {
// Increment i without applying the stride
i = oldI + 1
} else if i == oldI {
// Ensure i is always incremented to avoid an infinite loop
i++
}
}
}
return newData, nil
}
from vault-ai.
Related Issues (20)
- Use it from the command line
- 2 PDF articles: error, status code: 400, message: This model's maximum context length is 4097 tokens
- How to update OPEN AI API KEY? HOT 4
- pdfinfo not found
- PDF not analyzed, always answers with "Great, how may I assist you today?"
- Error: 413 | Request body size exceeds the limit HOT 1
- Error extracting text from PDF exec: "pdftotext": executable file not found in %PATH% HOT 2
- Error: NetworkError when attempting to fetch resource. HOT 1
- Error 401 HOT 2
- Documents not saving to knowledge base? HOT 2
- Do you support uploading code libraries (such as GitHub links) so that they can read and understand the source code?
- Am I supposed to see "Enter your OpenAI API key here.." field on http://localhost:8100/ ? HOT 1
- Stuck in the step Installing './vault-web-server' dependencies
- openAI embedding error timeout HOT 1
- Error getting embeddings: error, status code: 429, HOT 3
- Add documentation that explains how to hook up Qdrant vector db
- @babel/plugin-proposal-class-properties is no longer maintained
- Feature Request: Use Mistral or Llama instead of OpenAI HOT 1
- Can't get past injestion
- No Summary of content, just of style HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from vault-ai.