It is common for data files to include one or more lines at the beginning which contai

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Handle files with header rows using default text extractor about usql HOT 22 CLOSED

azure commented on May 28, 2024

Handle files with header rows using default text extractor

from usql.

Comments (22)

MikeRys commented on May 28, 2024 1

Dear Andre… not yet. The code is undergoing internal testing and will probably make it into the May release.

From: André Cruz [mailto:[email protected]]
Sent: Monday, April 11, 2016 10:13 AM
To: Azure/usql [email protected]
Cc: Michael Rys [email protected]
Subject: Re: [Azure/usql] Handle files with header rows using default text extractor (#10)

Is skipFirstNRows available at this time?

—
You are receiving this because you were assigned.
Reply to this email directly or view it on GitHub #10 (comment) https://github.com/notifications/beacon/ALNzJNoP0DEwq0SZyK218hcNnotemOsyks5p2oEAgaJpZM4GZF6S.gif

from usql.

MikeRys commented on May 28, 2024

Hi Hugh. We are working on adding this capability. We will provide an argument to the built-in extractors Csv(), Tsv() and Text() called skipFirstNRows. Example usage is

EXTRACT ... FROM ... USING Extractors.Csv(skipFirstNRows:2);

Note that it can only skip rows in the first extent of the file. Thus if you have too big rows and or too many rows to skip, you may get a runtime error if the extent does not contain all the rows.

Please let me know if this will address your requirement.

from usql.

asos-hughwoods commented on May 28, 2024

Hi Mike,

The skipFirstNRows argument looks like it would address my requirement. Can you clarify whether it skips the first N rows of each file, or the first N rows of the first file/total input?

I'm not familiar with the concept of the first extent of a file in this context. Typically header rows contain a relatively small amount of data, I wouldn't require the ability to skip arbitrarily large sections of the input files.

from usql.

MikeRys commented on May 28, 2024

It would skip the first N rows of each file in a file set.

The caveat about extent is that in order to parallelize extraction, files get split into extent. Each parallel execution can see its extent and the next extent at the moment. So if you want to skip more rows than the execution context that operates on the first extent can see, it will error.

from usql.

asos-hughwoods commented on May 28, 2024

Skipping the first N rows is perfect for my use case.

Thanks for explaining the extent. Without knowing exactly what the size of an extent will be, I can't say for sure, but I don't expect this to be an issue. The largest header row in my use case would be around 5kB.

from usql.

davebally commented on May 28, 2024

I addition to this, i often see files that have some 'pre-amble' ie contain useful data in a header.
Number of rows, date time of extraction etc..
Would be great to be able to process these direct without having to jump through too many hoops (ie processing the file twice).

from usql.

MikeRys commented on May 28, 2024

@hughwasos : an extent size is somewhat determined in how you uploaded the file. If you created it with append calls, it can be smaller (there is a minimum extent size, but not sure what it is at the moment). Normally an extent size is 250MB.

@davebally : Hi Dave, the built-in extractors only provide you the option to parse all lines as a single column then and you have to split the rows using U-SQL statements, or skip over them. If you want to extract such heading information as part of each row, you will have to write your custom extractor.

Note that adding the header information into every row though means that you either have to process the file atomically (one file per node) or fix up the rows produced by the nodes that are operating on later parts of the file later in U-SQL.

We have a feature on our backlog that will give you the option of returning more than one rowset that could be useful in this scenario, but currently that is pretty low in priority.

If you want to request a functionality to address the scenario beyond what you currently can do, please head to http://aka.ms/adlfeedback and file a request.

Thanks!

from usql.

RyanMathewson commented on May 28, 2024

I just tried using Extractors.Csv(skipFirstNRows:2) and received the following error:

ERRORID: E_CSC_USER_UNSUPPORTEDEXTRACTORSKIPFIRSTROWS
SEVERITY: Error
COMPONENT: CSC
SOURCE: USER
MESSAGE: Unsuppoted extractor parameter: skipFirstNRows.
DETAILS: N/A
DESCRIPTION: skipFirstNRows is not supported in current version.
RESOLUTION: Please remove skipFirstNRows parameter from the extractor parameters.
HELPLINK: N/A
FILEPATH: N/A
LINENUMBER: -1
STARTOFFSET: -1
ENDOFFSET: -1

Is this feature temporally disabled?

from usql.

MikeRys commented on May 28, 2024

@RyanMathewson : The feature is currently disabled since we need to put some lower-level infrastructure in place (that also allows us to more efficiently operate on files that are not row-oriented). Current ETA is March 2016 until that work is completed and allows us to re-enable the Skip first N Rows parameter.

Current work-arounds include:

Use silent:true. That will skip lines that do not fit the schema (too many, too less columns or values that cannot be cast to the desired type. However, it will skip all such rows.
EXTRACT into string columns and then use known strings or TryParse to filter the rows that you want/don't want. E.g. (message code, not compiled or executed):

REFERENCE TryParseAssembly;
    @input = EXTRACT col1 string, col2 string, col3 string 
                    FROM "filewithheader.csv" USING Extractors.Csv();
    @rows = SELECT col1 as long AS col1, col2, DateTime.Parse(col3) AS col3
                   FROM @input WHERE MyNS.MyClass.TryParseAsInt(col1);

...Where the TryParseAssembly contains wrappers for TryParse calls such as TryParseAsInt.

...Or if the header prefix is known:

REFERENCE TryParseAssembly;
    @input = EXTRACT col1 string, col2 string, col3 string 
                    FROM "filewithheader.csv" USING Extractors.Csv();
    @rows = SELECT col1 as long AS col1, col2, DateTime.Parse(col3) AS col3
                   FROM @input WHERE !col1.StartsWith("hdr");

Write your own custom Extractor where you handle the headerlines. Note that at this point, you will need to do it with atomicFileProcessing set to true. For an example on how to specify that, look at our XML and JSON extractor samples.

from usql.

dhrubajyoti36 commented on May 28, 2024

Can I select only 2 column for processing out of 10 column

from usql.

MikeRys commented on May 28, 2024

@dhrubajyoti36 : You can write a custom extractor that does that. Or you can do the EXTRACT of the full schema and then do a SELECT of only the two columns you are interested in. Note that for textual formats like Csv you have to read the data anyway...

from usql.

dhrubajyoti36 commented on May 28, 2024

Thank you mike.
Could You please give me a sample custom script for that
Another thing I am not able to use inner query or sub query to calculate percentage ( if it is possible without to many join ) can you give a example

from usql.

MikeRys commented on May 28, 2024

@dhrubajyoti36 : Can you open a new issue and paste your non working inner query/subquery? That makes it easier to find as an issue and I could tell you how to fix the query :).

As to a sample custom script:

Here is one that shows you how to do the EXTRACT and SELECT. Note that the EXTRACT will still parse the data but the SELECT then will reduce the rowset for the next steps.

@input = EXTRACT col1 int, col2 string, col3 DateTime, col4 string 
FROM "/mydirectory/myfile.csv"
USING Extractors.Csv();

@projinp = SELECT col1, col3 FROM @input;

// now operate on @projinp that has the columns pruned.

Writing a custom extractor is a bit more complex and needs some more time from my side. But basically, since you know how to parse the file, and the UDO model will provide you the requested schema from the EXTRACT clause, you can write the custom extractor so that only the requested columns are being provided.

We have some sample extractors here in the GitHub repo and I will add more over the next couple of weeks.

from usql.

dhrubajyoti36 commented on May 28, 2024

Hi Mike ,

Thank you for the script , Actualy I asked one sample of customized extract script so that I can build some of extract script which I used for several case ;
As per you script
"EXTRACT col1 int, col2 string, col3 DateTime,"

Col3 initialized as DateTime

when I try to do that in my sample its throw the error "Problem while converting input record".
I thought it was problem of file format because when I define as string it will run but if I define as DateTime it will not work

my sample file look like
123,25/01/2016
DECLARE @in string = "/Input/newfile.csv";
DECLARE @out string = "/Output/test_crm2.csv";

@querylog=EXTRACT
CustomerID int,
emailsreceivedOn DateTime
FROM @in
USING Extractors.Csv();
@Result =SELECT CustomerID,emailsreceivedOn FROM @querylog;
OUTPUT@result TO @out USING Outputters.Csv();

The error :
diagnosticCode": 195887125,
"severity": "Error",
"component": "RUNTIME",
"source": "User",
"errorId": "E_RUNTIME_USER_EXTRACT_COLUMN_CONVERSION_UNDEFINED_ERROR",
"message": "Failure when attempting to convert column data.",
"description": "2232352G30312G3230313622: Problem while converting input record. \nPosition: line 1, column 2.",
"resolution": "Check the input for errors or use "silent" switch to ignore over(under)-sized rows in the input.\nConsider that ignoring "invalid" rows may influence job results and that types have to be nullable for conversion errors to be ignored.",
"helpLink": "",
"details": "============================================================================================\nHEX:3132332D32352G30312G32303136\n ^\nTEXT:123,25/01/2016\n ^\n\n============================================================================================\n"

from usql.

aisley commented on May 28, 2024

You mention creating a custom extractor above. Could you provide more info on creating a custom extractor for a PDF file? The PDF would be a structured PDF with no Pics jut straight data.

from usql.

MikeRys commented on May 28, 2024

@dhrubajyoti36 : based on the details that show you what the data is that it tries to convert, it seems that the value it tries to convert is 25/01/2016 which is not recognized by the DateTime conversion using the default locale. You either need to provide the date in MM/DD/YYYY or YYYY-MM-DD format or extract it as a string and then process it with the locale that you want to support. For example:

DECLARE @in string = "/Temp/test.csv";
DECLARE @out string = "/Output/test_crm2.csv";

@querylog= EXTRACT
       CustomerID int,
       emailsreceivedOn string
 FROM @in
 USING Extractors.Csv();

@result = SELECT CustomerID
               , DateTime.ParseExact(emailsreceivedOn,"dd/MM/yyyy",System.Globalization.CultureInfo.InvariantCulture) AS emailsreceivedOn 
     FROM @querylog;

OUTPUT@result TO @out USING Outputters.Csv();

from usql.

MikeRys commented on May 28, 2024

@aisley : I am unfortunately not a PDF format expert, so I cannot be of help. You would either write a custom extractor that parses the PDF format and extracts your data, or you write a custom extractor that calls something that knows how to extract the data from your PDF.

from usql.

aisley commented on May 28, 2024

I can handle the pdf reader/extractor. Just need instruction on getting
started with a custom extractor. How to build/install. Any help would be
great.
On Jan 29, 2016 1:54 PM, "Michael Rys" [email protected] wrote:

@aisley https://github.com/aisley : I am unfortunately not a PDF format
expert, so I cannot be of help. You would either write a custom extractor
that parses the PDF format and extracts your data, or you write a custom
extractor that calls something that knows how to extract the data from your
PDF.

—
Reply to this email directly or view it on GitHub
#10 (comment).

from usql.

MikeRys commented on May 28, 2024

Understood.

You basically implement the IExtractor contract in an assembly that you register.

Do you know if the PDF format is splitable based on carriage return or linefeeds and split extends can be processed individually or if the format needs to be completely available?

In the later case, you can take a look at the example XML and JSON extractors here: https://github.com/Azure/usql/tree/master/Examples/DataFormats

I will upload some more sample extractors over the weekend that operate on splittable data.

Since this is starting to move away from the topic, please open a new one to make it more discoverable.

from usql.

edevil commented on May 28, 2024

Is skipFirstNRows available at this time?

from usql.

MikeRys commented on May 28, 2024

Just to close this issue (my apologies for the late reply): The feature was released in the August refresh. Release Notes are here.

from usql.

HarsimranjeetSingh commented on May 28, 2024

I tried using got below error, any suggestion ?

Error
`at token 'skipFirstNrows', line 16
near the ###:
string,
Duration int?,
Urls string,
ClickedUrls string
FROM @in
USING Extractors.Tsv( ### skipFirstNrows:1);

OUTPUT @searchlog
TO @out
USING Outputters.Csv();`

Usql code
`DECLARE @in string = "/Samples/Data/SearchLog.tsv";
DECLARE @out string = "/output/sl.csv";

@searchlog =
EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int?,
Urls string,
ClickedUrls string
FROM @in
USING Extractors.Tsv(skipFirstNrows:1);

OUTPUT @searchlog
TO @out
USING Outputters.Csv();
`

from usql.

Handle files with header rows using default text extractor about usql HOT 22 CLOSED

Comments (22)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent