Giter Site home page Giter Site logo

Comments (5)

chriswendt1 avatar chriswendt1 commented on August 19, 2024

Thanks for the tip. I'll have a look.

from documenttranslator-legacy.

chriswendt1 avatar chriswendt1 commented on August 19, 2024

@jsypkens, did you have a fix for this already? I'll try to look at it next weekend, but if you have a fix, that would save me.

from documenttranslator-legacy.

jsypkens avatar jsypkens commented on August 19, 2024

I made a few other changes to the method as well, and the implementation changed a little bit, so I'm not sure if my fix would be what you're looking for.

I separated out the "maxSize" into two parameters - maxListSize and maxElementSize, then looped through each element in values, checking if it would exceed the group or the element size. I also changed the return type to List<List<<Tuple<int,string>>, for two main reasons:

  1. It will track the source line number of each string element, so I can translate all the elements from multiple batches in parallel and then re-assemble the translated elements in the right order
  2. The v3 API calls fail when translating empty strings or whitespace (which is common when extracting text out of documents), so I'm removing those when translating, but re-inserting the white space in the correct place after translation returns

`public static List<List<Tuple<int, string>>> SplitList2(IEnumerable values, int groupSize, int maxListSize, int maxElementSize)
{
List<List<Tuple<int, string>>> result = new List<List<Tuple<int, string>>>();
int currentLineNumber = 0;
int currentListSize = 0;

        List<Tuple<int, string>> newList = new List<Tuple<int, string>>();

        foreach (var value in values)
        {
            if (newList.Count > groupSize)
            {
                result.Add(newList);
                newList = new List<Tuple<int, string>>();
                currentListSize = 0;
            }

            if (value.Length > maxElementSize) //current item is longer than item max, split it on sentence/paragraph
            {
                if (value.Length > maxListSize)
                {
                    //Split this item up! by itself, it's too long for the list. Use BreakSentence API to split on sentences.                        
                    BreakSentenceResponse[] BRR = TranslationServiceFacade.BreakSentences(new string[] { value });
                    List<string> sentences = new List<string>();
                    int currentPosition = 0;
                    foreach (var sentSplitLocation in BRR[0].sentLen) //BRR[0]: we're only passing in one value, we should only get one out
                    {
                        string currentSentence = value.Substring(currentPosition, sentSplitLocation);
                        if (sentences.Count > 0)
                        {
                            //Check to see if we will exceed the maxElementSize if we added this sentence
                            if (sentences.Last().Length + currentSentence.Length + 1 > maxElementSize)
                                sentences.Add(currentSentence); //would exceed the max, so just add it to the list separately
                            else
                                sentences[sentences.Count() - 1] += " " + currentSentence; //wouldn't exceed the max, append to the current list
                        }
                        else //first sentence, just add it
                            sentences.Add(currentSentence);
                        currentPosition += sentSplitLocation;
                    }

                    //loop through sentences, and add them to the list to be translated.
                    foreach (var sent in sentences) //make sure extractedSentence is under maxElementSize
                    {
                        if (currentListSize + sent.Length >= maxListSize || newList.Count >= groupSize)
                        {
                            result.Add(newList);
                            newList = new List<Tuple<int, string>>();
                            currentListSize = 0;
                        }
                        newList.Add(new Tuple<int, string>(currentLineNumber, sent));
                    }

                }
                else  //the element isn't too big for the list by itself...
                {
                    if (currentListSize + value.Length >= maxListSize) //next item will put the list over the max, start a new list
                    {
                        result.Add(newList);
                        newList = new List<Tuple<int, string>>();
                        currentListSize = 0;
                    }

                    BreakSentenceResponse[] BRR = TranslationServiceFacade.BreakSentences(new string[] { value });
                    List<string> sentences = new List<string>();
                    int currentPosition = 0;
                    foreach (var sentSplitLocation in BRR[0].sentLen) //we're only passing in one value, we should only get one out
                    {
                        string currentSentence = value.Substring(currentPosition, sentSplitLocation);
                        if (sentences.Count > 0)
                        {
                            if (sentences.Last().Length + currentSentence.Length + 1 > maxElementSize)
                                sentences.Add(currentSentence);
                            else
                                sentences[sentences.Count() - 1] += " " + currentSentence;
                        }
                        else
                            sentences.Add(currentSentence);
                        currentPosition += sentSplitLocation;
                    }


                    foreach (var sent in sentences) //make sure extractedSentence is under maxElementSize
                    {
                        if (currentListSize + sent.Length >= maxListSize || newList.Count >= groupSize)
                        {
                            result.Add(newList);
                            newList = new List<Tuple<int, string>>();
                            currentListSize = 0;
                        }
                        newList.Add(new Tuple<int, string>(currentLineNumber, sent));
                        currentListSize += sent.Length;
                    }
                }

            }
            else // element does not exceed the max element size
            {
                if (currentListSize + value.Length >= maxListSize) //make sure it won't exceed the current list size
                {
                    result.Add(newList);
                    newList = new List<Tuple<int, string>>();
                    currentListSize = 0;
                }

                newList.Add(new Tuple<int, string>(currentLineNumber, value));
                currentListSize += value.Length;
            }
            currentLineNumber++;
        }
        if (newList.Count > 0)
            result.Add(newList);

        foreach (var list in result)
        {


            int totalLength = 0;
            foreach (var item in list)
            {
                totalLength += item.Item2.Length;
            }
            if (totalLength > maxListSize)
            {
                logger.Error("list exceeds max. why?");
            }
        }

        return result;
    }

`

from documenttranslator-legacy.

chriswendt1 avatar chriswendt1 commented on August 19, 2024

@jsypkens, thanks a lot for posting your code. I'll have a look over the weekend. Good use of BreakSentences. That'll be required.

from documenttranslator-legacy.

chriswendt1 avatar chriswendt1 commented on August 19, 2024

Thanks, and finally fixed in the 2.3.0 release.
I ended up not using your code. I left the SplitList functin mostly intact, added an exit criterion, and allowed segments that are larger than 5000 characters to be translated relardless, by inserting a splitter function that uses the server's BreakSentence method.
I would love if you can give that a try and see whether it works for you.

from documenttranslator-legacy.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.