Giter Site home page Giter Site logo

rafaelcaballero / bigdatapython Goto Github PK

View Code? Open in Web Editor NEW
76.0 76.0 128.0 467 KB

Material de apoyo del libro BIG DATA CON PYTHON. Recolección, almacenamiento y procesamiento de datos, de Enrique Martín Martín, Adrián Riesco y Rafael Caballero, editado por RC libros

Home Page: https://rclibros.es/producto/big-data-con-python-recoleccion-almacenamiento-y-proceso/

Jupyter Notebook 100.00%

bigdatapython's People

Contributors

ariesco avatar emartinm avatar rafaelcaballero avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bigdatapython's Issues

Error con Pandas Al escribir un archivo en excel

Hola! tengo el siguiente codigo

import pandas
with pandas.ExcelFile('../../data/Cap1/subvenciones.xls') as xl:
    writer = pandas.ExcelWriter('../../data/Cap1/subvenciones_df.xls')
    for nombre in xl.sheet_names:
        df = xl.parse(nombre)
        df.to_excel(writer,nombre)
    writer.save()

pero encuentro un error algo raro, Alguien podria ayudarme?

File [c:\Users\PERSONAL\OneDrive\Documentos\PROGRAMACION](file:///C:/Users/PERSONAL/OneDrive/Documentos/PROGRAMACION) GITHUB\BIG DATA PYTHON\env\lib\site-packages\pandas\io\excel\_base.py:1136, in ExcelWriter.__new__(cls, path, engine, date_format, datetime_format, mode, storage_options, if_sheet_exists, engine_kwargs)
   [1135](file:///C:/Users/PERSONAL/OneDrive/Documentos/PROGRAMACION%20GITHUB/BIG%20DATA%20PYTHON/env/lib/site-packages/pandas/io/excel/_base.py:1135) try:
-> [1136](file:///C:/Users/PERSONAL/OneDrive/Documentos/PROGRAMACION%20GITHUB/BIG%20DATA%20PYTHON/env/lib/site-packages/pandas/io/excel/_base.py:1136)     engine = config.get_option(f"io.excel.{ext}.writer", silent=True)
   [1137](file:///C:/Users/PERSONAL/OneDrive/Documentos/PROGRAMACION%20GITHUB/BIG%20DATA%20PYTHON/env/lib/site-packages/pandas/io/excel/_base.py:1137)     if engine == "auto":

File [c:\Users\PERSONAL\OneDrive\Documentos\PROGRAMACION](file:///C:/Users/PERSONAL/OneDrive/Documentos/PROGRAMACION) GITHUB\BIG DATA PYTHON\env\lib\site-packages\pandas\_config\config.py:274, in CallableDynamicDoc.__call__(self, *args, **kwds)
    [273](file:///C:/Users/PERSONAL/OneDrive/Documentos/PROGRAMACION%20GITHUB/BIG%20DATA%20PYTHON/env/lib/site-packages/pandas/_config/config.py:273) def __call__(self, *args, **kwds) -> T:
--> [274](file:///C:/Users/PERSONAL/OneDrive/Documentos/PROGRAMACION%20GITHUB/BIG%20DATA%20PYTHON/env/lib/site-packages/pandas/_config/config.py:274)     return self.__func__(*args, **kwds)

File [c:\Users\PERSONAL\OneDrive\Documentos\PROGRAMACION](file:///C:/Users/PERSONAL/OneDrive/Documentos/PROGRAMACION) GITHUB\BIG DATA PYTHON\env\lib\site-packages\pandas\_config\config.py:146, in _get_option(pat, silent)
    [145](file:///C:/Users/PERSONAL/OneDrive/Documentos/PROGRAMACION%20GITHUB/BIG%20DATA%20PYTHON/env/lib/site-packages/pandas/_config/config.py:145) def _get_option(pat: str, silent: bool = False) -> Any:
--> [146](file:///C:/Users/PERSONAL/OneDrive/Documentos/PROGRAMACION%20GITHUB/BIG%20DATA%20PYTHON/env/lib/site-packages/pandas/_config/config.py:146)     key = _get_single_key(pat, silent)
    [148](file:///C:/Users/PERSONAL/OneDrive/Documentos/PROGRAMACION%20GITHUB/BIG%20DATA%20PYTHON/env/lib/site-packages/pandas/_config/config.py:148)     # walk the nested dict

File [c:\Users\PERSONAL\OneDrive\Documentos\PROGRAMACION](file:///C:/Users/PERSONAL/OneDrive/Documentos/PROGRAMACION) GITHUB\BIG DATA PYTHON\env\lib\site-packages\pandas\_config\config.py:132, in _get_single_key(pat, silent)
    [131](file:///C:/Users/PERSONAL/OneDrive/Documentos/PROGRAMACION%20GITHUB/BIG%20DATA%20PYTHON/env/lib/site-packages/pandas/_config/config.py:131)         _warn_if_deprecated(pat)
--> [132](file:///C:/Users/PERSONAL/OneDrive/Documentos/PROGRAMACION%20GITHUB/BIG%20DATA%20PYTHON/env/lib/site-packages/pandas/_config/config.py:132)     raise OptionError(f"No such keys(s): {repr(pat)}")
    [133](file:///C:/Users/PERSONAL/OneDrive/Documentos/PROGRAMACION%20GITHUB/BIG%20DATA%20PYTHON/env/lib/site-packages/pandas/_config/config.py:133) if len(keys) > 1:

OptionError: "No such keys(s): 'io.excel.xls.writer'"

The above exception was the direct cause of the following exception:
...
-> [1140](file:///C:/Users/PERSONAL/OneDrive/Documentos/PROGRAMACION%20GITHUB/BIG%20DATA%20PYTHON/env/lib/site-packages/pandas/io/excel/_base.py:1140)         raise ValueError(f"No engine for filetype: '{ext}'") from err
   [1142](file:///C:/Users/PERSONAL/OneDrive/Documentos/PROGRAMACION%20GITHUB/BIG%20DATA%20PYTHON/env/lib/site-packages/pandas/io/excel/_base.py:1142) # for mypy
   [1143](file:///C:/Users/PERSONAL/OneDrive/Documentos/PROGRAMACION%20GITHUB/BIG%20DATA%20PYTHON/env/lib/site-packages/pandas/io/excel/_base.py:1143) assert engine is not None

ValueError: No engine for filetype: 'xls'```

Problema con el Capitulo6

¡Hola!

He estado teniendo problemas al ejecutar una parte del Capítulo 6. Específicamente, estoy teniendo dificultades al intentar cargar todos los archivos en Spark y también al tratar de cargar todos los archivos con la extensión .txt. No logro identificar la causa exacta del error.

Aquí está el código que estoy utilizando:

r = sc.textFile("../../data/Cap6/file.txt")  # Carga un fichero
print(r.collect(),',\n')
r = sc.textFile("../../data/Cap6/*.txt")     # Carga todos los ficheros *.txt del directorio
print(r.collect(),',\n')
r = sc.textFile("../../data/Cap6/")           # Carga todos los ficheros del directorio
print(r.collect(),',\n')

Y este es el error que recibo:

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
Cell In[8], line 4
      2 print(r.collect(),',\n')
      3 r = sc.textFile("../../data/Cap6/*.txt")     # Carga todos los ficheros *.txt del directorio
----> 4 print(r.collect(),',\n')
      5 r = sc.textFile("../../data/Cap6/")           # Carga todos los ficheros del directorio
      6 print(r.collect(),',\n')

File ~\OneDrive\Documentos\PROGRAMACION GITHUB\BIG DATA PYTHON\env\lib\site-packages\pyspark\rdd.py:1833, in RDD.collect(self)
   1831 with SCCallSiteSync(self.context):
   1832     assert self.ctx._jvm is not None
-> 1833     sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
   1834 return list(_load_from_socket(sock_info, self._jrdd_deserializer))

File ~\OneDrive\Documentos\PROGRAMACION GITHUB\BIG DATA PYTHON\env\lib\site-packages\py4j\java_gateway.py:1322, in JavaMember.__call__(self, *args)
   1316 command = proto.CALL_COMMAND_NAME +\
   1317     self.command_header +\
   1318     args_command +\
   1319     proto.END_COMMAND_PART
   1321 answer = self.gateway_client.send_command(command)
-> 1322 return_value = get_return_value(
   1323     answer, self.gateway_client, self.target_id, self.name)
   1325 for temp_arg in temp_args:
   1326     if hasattr(temp_arg, "_detach"):

File ~\OneDrive\Documentos\PROGRAMACION GITHUB\BIG DATA PYTHON\env\lib\site-packages\py4j\protocol.py:326, in get_return_value(answer, gateway_client, target_id, name)
    324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
    325 if answer[1] == REFERENCE_TYPE:
--> 326     raise Py4JJavaError(
    327         "An error occurred while calling {0}{1}{2}.\n".
    328         format(target_id, ".", name), value)
    329 else:
    330     raise Py4JError(
    331         "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n".
    332         format(target_id, ".", name, value))

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.lang.UnsatisfiedLinkError: 'boolean org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(java.lang.String, int)'
	at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method)
	at org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:793)
	at org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:1249)
	at org.apache.hadoop.fs.FileUtil.list(FileUtil.java:1454)
	at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:601)
	at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972)
	at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014)
	at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:761)
	at org.apache.hadoop.fs.Globber.listStatus(Globber.java:128)
	at org.apache.hadoop.fs.Globber.doGlob(Globber.java:291)
	at org.apache.hadoop.fs.Globber.glob(Globber.java:202)
	at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:2142)
	at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:276)
	at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:244)
	at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:332)
	at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:208)
	at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:294)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:290)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
	at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:294)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:290)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2463)
	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1049)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:410)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:1048)
	at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:195)
	at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:842)

Si tienes alguna idea de qué podría estar causando este problema, ¡agradecería mucho tu ayuda!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.