Comments (22)
Dear Andre… not yet. The code is undergoing internal testing and will probably make it into the May release.
From: André Cruz [mailto:[email protected]]
Sent: Monday, April 11, 2016 10:13 AM
To: Azure/usql [email protected]
Cc: Michael Rys [email protected]
Subject: Re: [Azure/usql] Handle files with header rows using default text extractor (#10)
Is skipFirstNRows available at this time?
—
You are receiving this because you were assigned.
Reply to this email directly or view it on GitHub #10 (comment) https://github.com/notifications/beacon/ALNzJNoP0DEwq0SZyK218hcNnotemOsyks5p2oEAgaJpZM4GZF6S.gif
from usql.
Hi Hugh. We are working on adding this capability. We will provide an argument to the built-in extractors Csv(), Tsv() and Text() called skipFirstNRows. Example usage is
EXTRACT ... FROM ... USING Extractors.Csv(skipFirstNRows:2);
Note that it can only skip rows in the first extent of the file. Thus if you have too big rows and or too many rows to skip, you may get a runtime error if the extent does not contain all the rows.
Please let me know if this will address your requirement.
from usql.
Hi Mike,
The skipFirstNRows argument looks like it would address my requirement. Can you clarify whether it skips the first N rows of each file, or the first N rows of the first file/total input?
I'm not familiar with the concept of the first extent of a file in this context. Typically header rows contain a relatively small amount of data, I wouldn't require the ability to skip arbitrarily large sections of the input files.
from usql.
It would skip the first N rows of each file in a file set.
The caveat about extent is that in order to parallelize extraction, files get split into extent. Each parallel execution can see its extent and the next extent at the moment. So if you want to skip more rows than the execution context that operates on the first extent can see, it will error.
from usql.
Skipping the first N rows is perfect for my use case.
Thanks for explaining the extent. Without knowing exactly what the size of an extent will be, I can't say for sure, but I don't expect this to be an issue. The largest header row in my use case would be around 5kB.
from usql.
I addition to this, i often see files that have some 'pre-amble' ie contain useful data in a header.
Number of rows, date time of extraction etc..
Would be great to be able to process these direct without having to jump through too many hoops (ie processing the file twice).
from usql.
@hughwasos : an extent size is somewhat determined in how you uploaded the file. If you created it with append calls, it can be smaller (there is a minimum extent size, but not sure what it is at the moment). Normally an extent size is 250MB.
@davebally : Hi Dave, the built-in extractors only provide you the option to parse all lines as a single column then and you have to split the rows using U-SQL statements, or skip over them. If you want to extract such heading information as part of each row, you will have to write your custom extractor.
Note that adding the header information into every row though means that you either have to process the file atomically (one file per node) or fix up the rows produced by the nodes that are operating on later parts of the file later in U-SQL.
We have a feature on our backlog that will give you the option of returning more than one rowset that could be useful in this scenario, but currently that is pretty low in priority.
If you want to request a functionality to address the scenario beyond what you currently can do, please head to http://aka.ms/adlfeedback and file a request.
Thanks!
from usql.
I just tried using Extractors.Csv(skipFirstNRows:2) and received the following error:
ERRORID: E_CSC_USER_UNSUPPORTEDEXTRACTORSKIPFIRSTROWS
SEVERITY: Error
COMPONENT: CSC
SOURCE: USER
MESSAGE: Unsuppoted extractor parameter: skipFirstNRows.
DETAILS: N/A
DESCRIPTION: skipFirstNRows is not supported in current version.
RESOLUTION: Please remove skipFirstNRows parameter from the extractor parameters.
HELPLINK: N/A
FILEPATH: N/A
LINENUMBER: -1
STARTOFFSET: -1
ENDOFFSET: -1
Is this feature temporally disabled?
from usql.
@RyanMathewson : The feature is currently disabled since we need to put some lower-level infrastructure in place (that also allows us to more efficiently operate on files that are not row-oriented). Current ETA is March 2016 until that work is completed and allows us to re-enable the Skip first N Rows parameter.
Current work-arounds include:
- Use silent:true. That will skip lines that do not fit the schema (too many, too less columns or values that cannot be cast to the desired type. However, it will skip all such rows.
- EXTRACT into string columns and then use known strings or TryParse to filter the rows that you want/don't want. E.g. (message code, not compiled or executed):
REFERENCE TryParseAssembly;
@input = EXTRACT col1 string, col2 string, col3 string
FROM "filewithheader.csv" USING Extractors.Csv();
@rows = SELECT col1 as long AS col1, col2, DateTime.Parse(col3) AS col3
FROM @input WHERE MyNS.MyClass.TryParseAsInt(col1);
...Where the TryParseAssembly contains wrappers for TryParse calls such as TryParseAsInt.
...Or if the header prefix is known:
REFERENCE TryParseAssembly;
@input = EXTRACT col1 string, col2 string, col3 string
FROM "filewithheader.csv" USING Extractors.Csv();
@rows = SELECT col1 as long AS col1, col2, DateTime.Parse(col3) AS col3
FROM @input WHERE !col1.StartsWith("hdr");
- Write your own custom Extractor where you handle the headerlines. Note that at this point, you will need to do it with atomicFileProcessing set to true. For an example on how to specify that, look at our XML and JSON extractor samples.
from usql.
Can I select only 2 column for processing out of 10 column
from usql.
@dhrubajyoti36 : You can write a custom extractor that does that. Or you can do the EXTRACT of the full schema and then do a SELECT of only the two columns you are interested in. Note that for textual formats like Csv you have to read the data anyway...
from usql.
Thank you mike.
Could You please give me a sample custom script for that
Another thing I am not able to use inner query or sub query to calculate percentage ( if it is possible without to many join ) can you give a example
from usql.
@dhrubajyoti36 : Can you open a new issue and paste your non working inner query/subquery? That makes it easier to find as an issue and I could tell you how to fix the query :).
As to a sample custom script:
Here is one that shows you how to do the EXTRACT and SELECT. Note that the EXTRACT will still parse the data but the SELECT then will reduce the rowset for the next steps.
@input = EXTRACT col1 int, col2 string, col3 DateTime, col4 string
FROM "/mydirectory/myfile.csv"
USING Extractors.Csv();
@projinp = SELECT col1, col3 FROM @input;
// now operate on @projinp that has the columns pruned.
Writing a custom extractor is a bit more complex and needs some more time from my side. But basically, since you know how to parse the file, and the UDO model will provide you the requested schema from the EXTRACT clause, you can write the custom extractor so that only the requested columns are being provided.
We have some sample extractors here in the GitHub repo and I will add more over the next couple of weeks.
from usql.
Hi Mike ,
Thank you for the script , Actualy I asked one sample of customized extract script so that I can build some of extract script which I used for several case ;
As per you script
"EXTRACT col1 int, col2 string, col3 DateTime,"
Col3 initialized as DateTime
when I try to do that in my sample its throw the error "Problem while converting input record".
I thought it was problem of file format because when I define as string it will run but if I define as DateTime it will not work
my sample file look like
123,25/01/2016
DECLARE @in string = "/Input/newfile.csv";
DECLARE @out string = "/Output/test_crm2.csv";
@querylog=EXTRACT
CustomerID int,
emailsreceivedOn DateTime
FROM @in
USING Extractors.Csv();
@Result =SELECT CustomerID,emailsreceivedOn FROM @querylog;
OUTPUT@result TO @out USING Outputters.Csv();
The error :
diagnosticCode": 195887125,
"severity": "Error",
"component": "RUNTIME",
"source": "User",
"errorId": "E_RUNTIME_USER_EXTRACT_COLUMN_CONVERSION_UNDEFINED_ERROR",
"message": "Failure when attempting to convert column data.",
"description": "2232352G30312G3230313622: Problem while converting input record. \nPosition: line 1, column 2.",
"resolution": "Check the input for errors or use "silent" switch to ignore over(under)-sized rows in the input.\nConsider that ignoring "invalid" rows may influence job results and that types have to be nullable for conversion errors to be ignored.",
"helpLink": "",
"details": "============================================================================================\nHEX:3132332D32352G30312G32303136\n ^\nTEXT:123,25/01/2016\n ^\n\n============================================================================================\n"
from usql.
You mention creating a custom extractor above. Could you provide more info on creating a custom extractor for a PDF file? The PDF would be a structured PDF with no Pics jut straight data.
from usql.
@dhrubajyoti36 : based on the details that show you what the data is that it tries to convert, it seems that the value it tries to convert is 25/01/2016 which is not recognized by the DateTime conversion using the default locale. You either need to provide the date in MM/DD/YYYY or YYYY-MM-DD format or extract it as a string and then process it with the locale that you want to support. For example:
DECLARE @in string = "/Temp/test.csv";
DECLARE @out string = "/Output/test_crm2.csv";
@querylog= EXTRACT
CustomerID int,
emailsreceivedOn string
FROM @in
USING Extractors.Csv();
@result = SELECT CustomerID
, DateTime.ParseExact(emailsreceivedOn,"dd/MM/yyyy",System.Globalization.CultureInfo.InvariantCulture) AS emailsreceivedOn
FROM @querylog;
OUTPUT@result TO @out USING Outputters.Csv();
from usql.
@aisley : I am unfortunately not a PDF format expert, so I cannot be of help. You would either write a custom extractor that parses the PDF format and extracts your data, or you write a custom extractor that calls something that knows how to extract the data from your PDF.
from usql.
I can handle the pdf reader/extractor. Just need instruction on getting
started with a custom extractor. How to build/install. Any help would be
great.
On Jan 29, 2016 1:54 PM, "Michael Rys" [email protected] wrote:
@aisley https://github.com/aisley : I am unfortunately not a PDF format
expert, so I cannot be of help. You would either write a custom extractor
that parses the PDF format and extracts your data, or you write a custom
extractor that calls something that knows how to extract the data from your
PDF.—
Reply to this email directly or view it on GitHub
#10 (comment).
from usql.
Understood.
You basically implement the IExtractor contract in an assembly that you register.
Do you know if the PDF format is splitable based on carriage return or linefeeds and split extends can be processed individually or if the format needs to be completely available?
In the later case, you can take a look at the example XML and JSON extractors here: https://github.com/Azure/usql/tree/master/Examples/DataFormats
I will upload some more sample extractors over the weekend that operate on splittable data.
Since this is starting to move away from the topic, please open a new one to make it more discoverable.
from usql.
Is skipFirstNRows
available at this time?
from usql.
Just to close this issue (my apologies for the late reply): The feature was released in the August refresh. Release Notes are here.
from usql.
I tried using got below error, any suggestion ?
Error
`at token 'skipFirstNrows', line 16
near the ###:
string,
Duration int?,
Urls string,
ClickedUrls string
FROM @in
USING Extractors.Tsv( ### skipFirstNrows:1);
OUTPUT @searchlog
TO @out
USING Outputters.Csv();`
Usql code
`DECLARE @in string = "/Samples/Data/SearchLog.tsv";
DECLARE @out string = "/output/sl.csv";
@searchlog =
EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int?,
Urls string,
ClickedUrls string
FROM @in
USING Extractors.Tsv(skipFirstNrows:1);
OUTPUT @searchlog
TO @out
USING Outputters.Csv();
`
from usql.
Related Issues (20)
- JSON file with duplicate keys
- U-sql referenced assembly built with 4.5.1? Whereas documented 4.5 is needed HOT 11
- Any plans to open source the deployment / PackageDeploymentTool.exe?
- Legitimize the Microsoft.Analytics.Samples.Formats by removing "Samples" and moving out of the Examples folder HOT 1
- Retrieving the Properties Field using the Avro Extractor HOT 2
- Query to extract Key Value pair from Json HOT 1
- How to remove special characters in a column using usql
- Can this be used for outputting the headers from the Table Valued Function
- Update framework version or all .Net core
- Struggling to find a way to extract data between 2 strings
- are there examples showing how to convert csv to arvo? HOT 1
- Unable to install U-SQL Extensions HOT 5
- Local assembly registration
- Visual Studio: Local runs fail HOT 1
- Could not load type 'ScopeRuntime.ScopeDynamicPartitionedOutputCollector' error in local run HOT 2
- Is this project dead? HOT 2
- FAQ is a dead link HOT 1
- U-SQL reference documentation goes to a generic page HOT 1
- flowchart on stackedit
- test
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from usql.