cemsbv / pygef Goto Github PK

View Code? Open in Web Editor NEW

30.0 2.0 19.0 13.93 MB

Parse soil measurement data.

Home Page: https://cemsbv.github.io/pygef

License: MIT License

Python 100.00%

gef geotechnical-engineering pygef python

pygef's People

Contributors

Stargazers

Watchers

Forkers

martinapippi thomassweijen rbruins maarten-betman jmmaljaars pjhaest sslob daandw11 evetion songzhenhe grover0 sboonstraabt georgi3305 tversteeg rdwimmers amsterdam tlukkezen

pygef's Issues

Replace `.lazy().op().collect()` with `op()`

in case of a df.lazy().single_op().collect(), you can just write df.single_op()

Originally posted by @ritchie46 in #71 (comment)

Redundant columns. Can be derived.

Some columns are derived from other columns. Much like property decorator in python classes.

Reduce redundant data by:

self._df = <dataframe w/ base columns>

@property
def df(self):
    df = self.__df.assign(derived_a=self.__df['a'] + 2)
    df = self.__df.assign(derived_b=self.__df['b'] + 2)
    return df

Bug in the grouping algorithm

The following code produces and empty DataFrame

import os
from pygef import Cpt

path_cpt = os.path.join(os.environ.get("DOC_PATH"), "../pygef/test_files/cpt.gef")

cpt = Cpt(path_cpt)
cpt.classify(classification="robertson", do_grouping=True, min_thickness=0.2, water_level_NAP=-10)

Add CPTData documentation

There's no API reference documentation for the CPTData object

BUG: Parsing of zid is not working as expected

Current behaviour:

The regex string #ZID[=\s+]+[^,]*[,\s+]+([^,]+) in pygef.utils line 127 is not working as expected.

How to reproduce:

Insert the following text in https://regex101.com/. Using #ZID[=\s+]+[^,]*[,\s+]+([^,]+) the zid will not be parsed correctly.
#TESTID = B38C2094
#XYID = 31000,108025,432470
#ZID = 31000,-1.5
#MEASUREMENTTEXT = 9, maaiveld, vast horizontaal niveau

Possible solution:

Use #ZID[=\s+]+[^,]*[,\s+]+([^?!,$|\s$]+).

Parse #RECORDSEPARATOR header and use for parsing the cpt data

The cpt-data is always assumed to have a "!" value for the record-separator in _GefCpt.parse_data(), which is not desired.

Solution

Parse the #RECORDSEPARATOR header and use it for splitting the cpt-data records.

update soil distribution

update soil distribution based on review CRUX

opmerkingen_grondbeschrijving.xlsx

Polars Series not supported in Matplotlib plot

From Polars version 0.12 it's not possible to do the following:

plt.plot(pl.DataFrame["x"], pl.DataFrame["y"])

This creates a NotImplementedError

Suggest the following:

plt.plot(pl.DataFrame["x"].to_numpy(), pl.DataFrame["y"].to_numpy())

One of the lines to change >
https://github.com/cemsbv/pygef/blob/master/pygef/plot_utils.py#L88

inclination column voids not handled well when correcting for depth

Had an issue where the column voids in the inclination column are used to correct for the depth. In combination with a pre-excavation this leads to an incorrect starting depth with respect to the reference level.

EDIT
More detailed: if the first row of the GEF contains a void in the inclination with a value such as -9999 then this value is used to correct for the depth, this will lead to significant errors in the corrected depth when a large pre-excavated depth is present. The desired solution is to deal with voids before the values that are used to indicate a void can be used for calculations.

Add the interpolation

#71 removes the interpolation, add it again. There's a # TODO for it.

Unable to read gef file

Hello,

I have been trying to read a gef file using the following code:

from pygef import Cpt

Cpt(r"..\GO\46358_10.GEF")

Unfortunately I get an error:

thread '<unnamed>' panicked at 'python apply failed: Any(InvalidOperation("abs not supportedd for series of type Float64"))', src\lazy\apply.rs:35:19
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "C:\Users\921479\Desktop\GEO\Z002\notebooks\test_gef.py", line 5, in <module>
    Cpt(r"..\GO\46358_10.GEF")
  File "C:\Users\921479\Desktop\GEO\Z002\notebooks\.venv\lib\site-packages\pygef\cpt.py", line 159, in __init__
    parsed = _GefCpt(path)
  File "C:\Users\921479\Desktop\GEO\Z002\notebooks\.venv\lib\site-packages\pygef\gef.py", line 276, in __init__
    self.parse_data(self._headers, self._data, column_names)
  File "C:\Users\921479\Desktop\GEO\Z002\notebooks\.venv\lib\site-packages\polars\lazy\frame.py", line 299, in collect
    return pl.eager.frame.wrap_df(ldf.collect())
pyo3_runtime.PanicException: python apply failed: Any(InvalidOperation("abs not supportedd for series of type Float64"))

I think there is a problem in the .collect() statement when the cpt data is converted to a DataFrame. Has anyone come across this issue before? Does there exist a workaround? Thanks for your consideration.

Drop pandas dependency and use polars

GEF with single line data, can't be read

Hi,
I have several boreholes gef files and in all gef where the description has a single line (see the attached file below) the code returns

self._df = PyDataFrame.read_csv(

RuntimeError: Any(NoData("empty csv"))

Can you support this? Below the gef file I am referring to

#GEFID= 1, 1, 0
#FILEOWNER= DataWS
#FILEDATE= 2022, 12, 22
#PROJECTID= Lob van Gennep, 2102701 HB, -
#COLUMN= 2
#COLUMNINFO= 1, m, Laag van, 1
#COLUMNINFO= 2, m, Laag tot, 2
#COMPANYID= -, -, 31
#DATAFORMAT= ASCII
#COLUMNSEPARATOR= ;
#COLUMNTEXT= 1
#LASTSCAN= 1
#XYID= 31000, 196276.20, 412672.60, 0.01, 0.01
#ZID= 31000, 13.22, 0.01
#MEASUREMENTCODE= NEN5104, 1, 0, 0, NNI 1989
#MEASUREMENTTEXT= 3, -, plaatsnaam boring
#MEASUREMENTTEXT= 5, 2022-03-02, datum boorbeschrijving
#MEASUREMENTTEXT= 6, Tla, beschrijver lagen
#MEASUREMENTTEXT= 7, 31000, locaal coördinatiesysteem
#MEASUREMENTTEXT= 8, 31000, locaal referentiesysteem
#MEASUREMENTTEXT= 9, maaiveld, vast horizontaal niveau
#MEASUREMENTTEXT= 13, -, boorbedrijf
#MEASUREMENTTEXT= 14, Nee, openbaar
#MEASUREMENTTEXT= 16, 2022-03-02, datum boring
#MEASUREMENTTEXT= 18, Nee, Peilbuis aanwezig
#MEASUREMENTTEXT= 23, Tla, naam boormeester
#MEASUREMENTTEXT= 31, EDM, boormethode1
#MEASUREMENTVAR= 16, 2.500000, m, eind diepte boring
#MEASUREMENTVAR= 31, 2.500000, m, diepte onderkant boortraject1
#SPECIMENTEXT= 11, 1, monstercode monster1
#SPECIMENTEXT= 12, 2022-03-02, datum monster1
#SPECIMENTEXT= 13, 13:18:29, tijd monster1
#SPECIMENTEXT= 14, G, (on)geroerd monster1
#SPECIMENTEXT= 18, 2, monstercode monster2
#SPECIMENTEXT= 19, 2022-03-02, datum monster2
#SPECIMENTEXT= 20, 13:18:29, tijd monster2
#SPECIMENTEXT= 21, G, (on)geroerd monster2
#SPECIMENTEXT= 25, 3, monstercode monster3
#SPECIMENTEXT= 26, 2022-03-02, datum monster3
#SPECIMENTEXT= 27, 13:18:29, tijd monster3
#SPECIMENTEXT= 28, G, (on)geroerd monster3
#SPECIMENVAR= 1, 3.000000, -, aantal monsters
#SPECIMENVAR= 12, 1.000000, m, onderkant monster1
#SPECIMENVAR= 18, 1.000000, m, bovenkant monster2
#SPECIMENVAR= 19, 2.000000, m, onderkant monster2
#SPECIMENVAR= 25, 2.000000, m, bovenkant monster3
#SPECIMENVAR= 26, 2.500000, m, onderkant monster3
#PROCEDURECODE= GEF-BORE-Report, 1, 0, 0, -
#TESTID= 183.HB4
#REPORTCODE= GEF-BORE-Report, 1, 0, 0, -
#RECORDSEPARATOR= !
#OS= DOS
#LANGUAGE= NL
#EOH=
0.0000e+000;2.5000e+000;'Kz3';;'DO BR';;'KHRD';!

Replace pipes with a single select

Todo: Most of these pipes likely can be done in a single select. We only have to check which ones depend on the result of a previous one. Those need to be done in separate select queries.

Originally posted by @ritchie46 in #71 (comment)

Plot error with grouped option

Running the following code

def test_plot_classification_grouped(self):
        gef = Cpt("./tests/test_files/cpt.gef")
        gef.plot(
            show=False,
            classification="three_type_rule",
            do_grouping=True,
            min_thickness=0.2,
            water_level_NAP=-10,
        )

with this cpt

throws an error;

  File "c:\Users\brein\Documents\Development\Python\pygef\.env\lib\site-packages\matplotlib\axes\_axes.py", line 2381, in bar
    bottom = y - height / 2
TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "C:\Python310\lib\unittest\case.py", line 59, in testPartExecutor
    yield
  File "C:\Python310\lib\unittest\case.py", line 591, in run
    self._callTestMethod(testMethod)
  File "C:\Python310\lib\unittest\case.py", line 549, in _callTestMethod
    method()
  File "c:\Users\brein\Documents\Development\Python\pygef\tests\test_plot.py", line 45, in test_plot_classification_grouped
    gef.plot(
  File "c:\Users\brein\Documents\Development\Python\pygef\pygef\cpt.py", line 355, in plot
    return plot.plot_cpt(
  File "c:\Users\brein\Documents\Development\Python\pygef\pygef\plot_utils.py", line 119, in plot_cpt
    fig = add_grouped_classification(
  File "c:\Users\brein\Documents\Development\Python\pygef\pygef\plot_utils.py", line 250, in add_grouped_classification
    plt.barh(
  File "c:\Users\brein\Documents\Development\Python\pygef\.env\lib\site-packages\matplotlib\pyplot.py", line 2403, in barh
    return gca().barh(
  File "c:\Users\brein\Documents\Development\Python\pygef\.env\lib\site-packages\matplotlib\axes\_axes.py", line 2551, in barh
    patches = self.bar(x=left, height=height, width=width, bottom=y,
  File "c:\Users\brein\Documents\Development\Python\pygef\.env\lib\site-packages\matplotlib\__init__.py", line 1412, in inner
    return func(ax, *map(sanitize_sequence, args), **kwargs)
  File "c:\Users\brein\Documents\Development\Python\pygef\.env\lib\site-packages\matplotlib\axes\_axes.py", line 2383, in bar
    raise TypeError(f'the dtypes of parameters y ({y.dtype}) '
TypeError: the dtypes of parameters y (object) and height (object) are incompatible```

Change "NAP" to "ref" or allow only NAP as reference system

Currently it is considered that the height system is always NAP, but this is not always the case. The attribute ParseGEF.height_system can also be different from the one associated to NAP.

We can change the name "NAP" to the more generic "ref" in the whole code.

Next scheduled release

@martinapippi Is the Pygef team planning to release a new version to PyPI following the recent PR's?

Import from gpkg

There are now CPTs saved in gpkg format. I have a reader that user provides a box with coordains (or polygon) and it reads all cpts exist there (and plots in vtk file format, perhaps irrelevant here).
It would be great if you could make your package compatible with this file. It saves a ton of time to bring each xml/gef separately (plus you don't need an xml reader anymore).

Is this something you can support?

bro_reader - Copy.txt

Remove `fill_nan` operations

This operation fill_nan will be redundant if we write the functions correctly. Make sure that any computations we do not generate nans.

Originally posted by @ritchie46 in #71 (comment)

Test IDs that contain spaces are not fully parsed

If a test ID contains spaces e.g. #TESTID= CPT 01 then it is parsed as: CPT. This becomes problematic when a series of CPTs is enumerated in this way.

The desired solution for this would be to parse everything on the line behind #TESTID= , so in the example this would result in CPT 01.

Allow patch range in setup.py requirements

requirements in setup.py should not be pinned to an exact version, but rather allow a range of patch versions

Replace assignments with expressions

This should all be expressions.

col("fs") / col("qc") * 100.0

pl.lit(0.0).alias("friction_number")

Originally posted by @ritchie46 in #71 (comment)

Add contribution guide

A contribution guide should be added.
Example: https://github.com/ritchie46/polars/blob/master/CONTRIBUTING.md

Fix: empty column is added when data row doesn't ends with COLUMNSEPARATOR

A new and empty column is added when the rows in the cpt data end with solely the column-separator value (in _GefCpt.parse_data())

The last (redundant) column separator is only removed when it is followed by "!"

Solution

Data records should always be stripped of trailing Column AND Record separators, even if one of them is not present.

Feature request: Extract Cone Id

Request that feature is added to extract cone_id.
Rationale is:

We compare cone_id's to the calibration certificates provided by the soil investigation subcontractors;
We track how many meters and tests the individual cones have done, this can flag the cones for recalibration;
Monitor the zero-drift over time;
Cone id is saved in database linked to corresponding tests. If a cone appears faulty, all previous tests can be inspected

I will make a PR with this feature

Return polars expression in queries

Not for this PR per se, but we can later change these kind of function to return the expression needed in the select query.

Originally posted by @ritchie46 in #71 (comment)

ValueError: Could not find been_jeffries. Check the spelling or classification not defined in the library

This code gives an error. Can you fix this? Or am I doing something wrong? Thank you!

from pygef.gef import ParseGEF
File = ".GEF"
gef = ParseGEF(File)
gef.plot(classification='been_jeffries',water_level_NAP= -1,min_thickness= 0.2,show=True)

sort dataframe

add sorting to dataframe.

Handle columnvoid headers for each column separately

Currently, the first encountered valid #COLUMNVOID header is used and replaced with None in all columns. The other #COLUMNVOID values are effectively ignored.

Desired functionality

The #COLUMNVOID values are provided for each column separately in the .gef file and should be applied to the corresponding columns only.

polars internal errors

for polars[pyarrow]<0.16.3:

TypeError: with_columns() takes from 1 to 2 positional arguments but 3 were given

bump polars to 0.16.3

Robertson classification and pandas

In pandas 1.4.3, when I classify a CPT, i get an error
'DataFrame' object does not support 'Series' assignment by index. Use 'DataFrame.with_columns'

Add Coordinate Reference System EPSG codes as attributes of Cpt object

The pygef objects currently have coordinates (e.g. x, y and z) but have no proper universal Coordinate Reference System (CRS) definition. Only the "height_system" is a vertical CRS attribute, but the codes are linked to the GEF format, which is not a universal standard.

As a user I would like to be able to access an attribute of a Cpt object with the EPSG codes for both the horizontal (x & y) and vertical (z) oriented coordinates. These could be two attributes, e.g. xy_epsg, z_epsg.

EPSG codes are universally recognized geodetic definitions and have a scope way beyond geotechnical engineering, which makes a spatial object defined with epsg codes easy to work with by anyone.

The few CRS codes that are defined in the GEF format can be mapped to an EPSG code upon parsing, which will make the "height_system" attribute obsolete. See for instance the EPSG of RD and NAP (most commonly used in the Netherlands)

Add use_old_naming flag

This function should start with being True and after a new major version it should be False.

Bug plot boreholes

The soil type NBE (not a recognized soil) seems to generate a wrong plot.

path_bore = os.path.join(os.environ.get("DOC_PATH"), "../pygef/test_files/example_bore.gef")

bore = Bore(path_bore)
bore.plot()

read_cpt should return dedicated exception if file is not gef or xml

If the user provides a file to read_cpt that is neither in the correct .gef or .xml format, pygef tries to read as an xml and throws confusing exceptions.

Specifically, if a user provides a .gef file with erroneous format, the user gets BroXMLParser exceptions, which makes no sense.

Expected result
The user should get an insightful exception (e.g. custom UnknownFileFormatException) that explains that the provided file cannot be parsed as a .gef or .xml

Use categorical dtype for soil layer

Once #71 is done.

Use proper unit testing methods

Replace for loops over columns with frame_equal or even better a (currently non-existing) assert_frame_equals function.

Remove pandas dependency

There's still a pandas dependency needed for some pl.from_pandas() and df.to_pandas() calls. When these have been removed pandas can also be removed from the requirements.

Allow import/export from JSON

The current format is not serializable, so it can't be non-trivially sent over a network connection. We need to add Cpt.from_json() and Cpt.to_json() methods (same for Borehole).

Bug parse CPT as a result of a Polars PanicException

Parsing the following cpt gives an error:

/opt/conda/lib/python3.7/site-packages/pygef/cpt.py in __init__(self, path, content)
    151             assert content["string"] is not None, "content['string'] must be specified"
    152             if content["file_type"] == "gef":
--> 153                 parsed = _GefCpt(string=content["string"])
    154             elif content["file_type"] == "xml":
    155                 parsed = _BroXmlCpt(string=content["string"])

/opt/conda/lib/python3.7/site-packages/pygef/gef.py in __init__(self, path, string)
    288                         calculate_friction_number(column_names),
    289                         self.calculate_elevation_with_respect_to_nap(
--> 290                             self.zid, self.height_system
    291                         ),
    292                     ]

/opt/conda/lib/python3.7/site-packages/polars/lazy/frame.py in collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, string_cache, no_optimization)
    297             string_cache,
    298         )
--> 299         return pl.eager.frame.wrap_df(ldf.collect())
    300 
    301     def fetch(

PanicException: python apply failed: Any(InvalidOperation("abs not supportedd for series of type Float64"))

file.zip

How to reproduce:
pygef.Cpt(path="./file.gef")

Issue with plotting with Robertson classification

Hi there,

I am a new user of the package pygef.

I sucessfully plotted a cpt but when I add "roberston" to the plotting function it returns an error:

cpt = Cpt(path)
a = cpt.plot("robertson")
a.show()


Traceback (most recent call last):
  File "C:\Users\amo.IPC\AppData\Local\Programs\Python\Python310\lib\site-packages\win32com\server\policy.py", line 303, in _Invoke_
    return self._invoke_(dispid, lcid, wFlags, args)
  File "C:\Users\amo.IPC\AppData\Local\Programs\Python\Python310\lib\site-packages\win32com\server\policy.py", line 308, in _invoke_
    return S_OK, -1, self._invokeex_(dispid, lcid, wFlags, args, None, None)
  File "C:\Users\amo.IPC\AppData\Local\Programs\Python\Python310\lib\site-packages\win32com\server\policy.py", line 637, in _invokeex_
    return func(*args)
  File "C:\Users\amo.IPC\AppData\Local\Programs\Python\Python310\lib\site-packages\xlwings\server.py", line 235, in CallUDF
    res = call_udf(script, fname, args, this_workbook, FromVariant(caller))
  File "C:\Users\amo.IPC\AppData\Local\Programs\Python\Python310\lib\site-packages\xlwings\udfs.py", line 539, in call_udf
    ret = func(*args)
  File "c:\Users\amo.IPC\OneDrive\02_docadmin\Software_and_Spreadsheets\Python\methods\testfile.py", line 11, in cpt
    a = cpt.plot("robertson")
  File "C:\Users\amo.IPC\AppData\Local\Programs\Python\Python310\lib\site-packages\pygef\cpt.py", line 301, in plot
    df = self.classify(
  File "C:\Users\amo.IPC\AppData\Local\Programs\Python\Python310\lib\site-packages\pygef\cpt.py", line 208, in classify
    df = robertson.classify(
  File "C:\Users\amo.IPC\AppData\Local\Programs\Python\Python310\lib\site-packages\pygef\robertson\__init__.py", line 39, in classify
    return iterate_robertson(
  File "C:\Users\amo.IPC\AppData\Local\Programs\Python\Python310\lib\site-packages\pygef\robertson\util.py", line 96, in iterate_robertson
    df["n"] = n
  File "C:\Users\amo.IPC\AppData\Local\Programs\Python\Python310\lib\site-packages\polars\internals\dataframe\frame.py", line 1401, in __setitem__
    raise TypeError(
TypeError: 'DataFrame' object does not support 'Series' assignment by index. Use 'DataFrame.with_columns'

any advise?

Pin requirements.txt versions

Requirements in requirements.txt should be pinned to an exact version

Fill dataframes with single value correctly

Replace these instances marked by a # TODO:

df["gamma_predict"] = np.tile(1.0, len(df.rows()))

 TypeError: __init__() got an unexpected keyword argument 'layout'

for matplotlib==3.5.0

ValueError: Cannot __getitem__ on Series of dtype: 'Float64' with argument: '(slice(None, None, None), None)' of type: '<class 'tuple'>'.

bump matplotlib version to 3.6.0

../pygef/gef.py:611: RuntimeWarning: invalid value encountered in true_divide
  df = df.assign(friction_number=(df["fs"].values / df["qc"].values * 100))

cemsbv / pygef Goto Github PK

pygef's People

Contributors

Stargazers

Watchers

Forkers

pygef's Issues

Current behaviour:

How to reproduce:

Possible solution:

Solution

Solution

Desired functionality

Recommend Projects

Recommend Topics

Recommend Org