brendonh / pyth Goto Github PK
View Code? Open in Web Editor NEWPython text markup and conversion
License: MIT License
Python text markup and conversion
License: MIT License
setup.py shows the current version is 0.6.0 but the __version__
string in __init__.py
is still set to 0.5.6. I noticed this when I did a pip install
today and tried to track down why I was getting the version from 2010 rather than 2014. (Turns out I wasn't; the version string just needs to be updated.)
If you could please fix this discrepancy, I would appreciate it. Thanks!
Looking for the capability to parse the string content instead of reading from a file. More like an overload of Ptf15Reader.Read() function which excepts string as an input.
My code is reading the content from a database table and one of the columns contains RTF content which needs to be parsed as plain text.
For example: Column content
{\rtf1\fbidis\ansi\ansicpg1252\deff0\nouicompat\deflang1033{\fonttbl{\f0\fnil\fcharset0 Segoe UI;}{\f1\fnil Segoe UI;}}{\colortbl ;\red0\green0\blue0;}{\*\generator Riched20 15.0.4811}{\*\mmathPr\mwrapIndent1440 }\viewkind4\uc1 \pard\cf1\f0\fs20\lang2057 hi\f1\lang1033\par{\*\lyncflags<rtf=1>}}
Should be parsed as:
hi
Is there a way to achieve it using this utility?
See
pyth/pyth/plugins/plaintext/writer.py
Line 36 in f2a06fc
newline
argument should be honored when using PlainTextWriter.write
Hi
The latest pyth on PyPi is from Aug. 2010. Since then, commit c444a8d has fixed an issue that has came back to bite me repeatedly. I would appreciate if you could upload a more recent version of pyth, so that when I install from PyPi, it comes without this issue.
Thanks!
The code currently passes over all non-photo fields, including textboxes with text in them, drop-down menus with items selected, and check blocks that are checked.
I tried to figure out how to deal with this into the code's current structure, by incorporating something into the handle_field method. But I couldn't figure it out. So instead I made a pre-processing function that goes through and finds those types of fields, and replaces those fields with whatever text was supposed to be there: the entered text if a textbox, the selected text if a drop-down list (or the default if that's appropriate), and a "Yes" or "No" if it was a checkblock. Then, when you run it through the converter it will come out as plain text.
This required regex rather than re.
I'm not submitting this as a pull request because I'm not sure where you'd want to include this sort of pre-processing. But if you want to do so, here is the function:
import regex
def flattenrtffields(rawrtf):
#get all "fields" including nested
fieldsearch=regex.compile(r"{\\field[^{]*?({(?>[^{}]+|(?1))*})({(?>[^{}]+|(?1))*})}")
m = fieldsearch.finditer(rawrtf)
if m:
textboxes,drops,checks=[],[],[]
checkboxoptions=["No","Yes"]
#Make lists of the kinds of fields to flatten
for field in m:
if "FORMTEXT" in field[0]:
textboxes.append(field[0])
elif "FORMDROPDOWN" in field[0]:
drops.append(field[0])
elif "FORMCHECKBOX" in field[0]:
checks.append(field[0])
else:
pass
#deal with textboxes
for textbox in textboxes:
try:
result = regex.search(r"fldrslt ({(?>[^{}]+|(?1))*})}",textbox)[1]
if result:
rawrtf=rawrtf.replace(textbox,result)
except:
pass
#deal with dropdownlists
for drop in drops:
try:
ddresult = regex.search(r"fftype2.*ffres([0-9]*)",drop)[1]
if ddresult=="25":
ddresult=regex.search(r"ffdefres([0-9]*)",drop)[1]
ddlist = re.findall(r"ffl ([^}]*)}",drop)
rawrtf=rawrtf.replace(drop,"{\\rtlch "+ddlist[int(ddresult)]+"}")
except:
pass
#deal with checkboxes
for check in checks:
try:
result = regex.search(r"fftype1.*ffres([0-9]*)",check)[1]
if result=="25":
result=regex.search(r"ffdefres([0-9]*)",check)[1]
rawrtf=rawrtf.replace(check,"{\\rtlch "+checkboxoptions[int(ddresult)]+"}")
except:
pass
return rawrtf
When I read an RTF and write it out as plain text (both with pyth), all of the hex for embedded images is included in the document. As expected, the \pict
control group itself is gone.
At the moment, I'm preprocessing these files to wipe out the pict
group (hex included) before using pyth, but, of course, it would be nice to avoid that. I'm not familiar enough with RTF versions to know if this is part of the 1.5 spec or a later one. However, these files run perfectly otherwise.
I can send you an example, if needed.
I have openoffice rtf files with nonasci metadata (author):
{\info{\author Claudia Jürgens}{\creatim\yr2010\mo7\dy19\hr12\min45}{\author Claudia Jürgens}
{\revtim\yr2010\mo7\dy28\hr13\min27}{\printim\yr0\mo0\dy0\hr0\min0}{\comment
StarWriter}{\vern3000}}\deftab709
This causes UnicodeDecodeError:
Module pyth.plugins.rtf15.reader, line 93, in read
Module pyth.plugins.rtf15.reader, line 113, in go
Module pyth.plugins.rtf15.reader, line 147, in parse
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 0: ordinal not in range(128)
This patch just catches the error:
*** reader.py 2010-05-04 21:48:14.000000000 +0200
--- reader.py 2010-08-04 21:47:10.000000000 +0200
***************
*** 140,146 ****
control, digits = self.getControl()
self.group.handle(control, digits)
else:
! self.group.char(unicode(next))
def getControl(self):
--- 140,149 ----
control, digits = self.getControl()
self.group.handle(control, digits)
else:
! try:
! self.group.char(unicode(next))
! except UnicodeDecodeError, e:
! self.group.char('?')
def getControl(self):
Parsing a string like \'hello
causes an exception because the RTF reader expects the following two characters to be hex digits.
Hello!
Could you please help.
I have file.rtf with table inside. I need to read text from table.
But program do not make any deviding when move from one cell to another inside one row.
Example:
Input: Cell 1 : How / Cell 2 : are / Cell 3 : you
Output: Howareyou
My code is:
from pyth.plugins.rtf15.reader import Rtf15Reader
from pyth.plugins.plaintext.writer import PlaintextWriter
doc = Rtf15Reader.read(open('test.rtf', 'rb'))
PlaintextWriter.write(doc).getvalue()
Thank you!
When trying to read https://www.gnu.org/licenses/lgpl.rtf I get:
>>> b=Rtf15Reader.read(open('lgpl.rtf', 'rb'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/pom/tmp/local/lib/python2.7/site-packages/pyth/plugins/rtf15/reader.py", line 86, in read
return reader.go()
File "/home/pom/tmp/local/lib/python2.7/site-packages/pyth/plugins/rtf15/reader.py", line 109, in go
self.parse()
File "/home/pom/tmp/local/lib/python2.7/site-packages/pyth/plugins/rtf15/reader.py", line 143, in parse
self.group.handle(control, digits)
File "/home/pom/tmp/local/lib/python2.7/site-packages/pyth/plugins/rtf15/reader.py", line 402, in handle
handler(digits)
File "/home/pom/tmp/local/lib/python2.7/site-packages/pyth/plugins/rtf15/reader.py", line 521, in handle_ansi_escape
char = chr(code).decode(self.charset, self.reader.errors)
UnicodeDecodeError: 'cp932' codec can't decode byte 0x81 in position 0: incomplete multibyte sequence
I found something new ;), send you a test file.
The xhml output should be:
Mit fünf Jahren
Instead it shows:
Mit fŸnf Jahren
My rtf file has "}" symbols ( "{" ), but the generated XHTML has them no more.
Send you an test file.
Any chance you can cut a release and bump the PyPI version? Thanks!
send you a test file
Module pyth.plugins.rtf15.reader, line 103, in read
Module pyth.plugins.rtf15.reader, line 124, in go
Module pyth.plugins.rtf15.reader, line 155, in parse
Module pyth.plugins.rtf15.reader, line 385, in char
TypeError: decode() argument 1 must be string, not dict
In setup.py
, the home page of this project is set to https://wiki.github.com/brendonh/pyth . This link is broken.
This shows up as a link in the page https://pypi.org/project/pyth/ , clicking on "Homepage" will lead to a 404 error.
Im using rtf files generated by pandoc. They have a lot of "\f0" control words (no idea why).
/plugins/rtf15/reader.py cannot read these files because of this "\f0" word.
For a general solution, could you skip unknown control words?
Example rtf:
{\rtf\ansi\deff0{\fonttbl{\f0\froman Tms Rmn;}{\f1\fdecor
Symbol;}{\f2\fswiss Helv;}}{\colortbl;\red0\green0\blue0;
\red0\green0\blue255;\red0\green255\blue255;\red0\green255
blue0;\red255\green0\blue255;\red255\green0\blue0;\red255
green255\blue0;\red255\green255\blue255;}{\stylesheet{\fs20
\snext0Normal;}}{\info{\author John Doe}
{\creatim\yr1990\mo7\dy30\hr10\min48}{\version1}{\edmins0}
{\nofpages1}{\nofwords0}{\nofchars0}{\vern8351}}\widoctrl\ftnbj \sectd\linex0\endnhere \pard\plain \fs20 This is plain text.\
I have a rtf file with strange unicode strings (send you an email).
This causes rtf reader to throw ValueError:
* Module pyth.plugins.rtf15.reader, line 93, in read
* Module pyth.plugins.rtf15.reader, line 113, in go
* Module pyth.plugins.rtf15.reader, line 141, in parse
* Module pyth.plugins.rtf15.reader, line 369, in handle
* Module pyth.plugins.rtf15.reader, line 476, in handle_u
ValueError: unichr() arg not in range(0x10000) (narrow Python build)
The reason why is, my python was build without support for "wide" Unicode characters. (http://www.python.org/dev/peps/pep-0261/). However, an exception handling would be nice.
It could be nice if the RTF reader was able to parse \listtable
and read \levelnfc
(type of list) and possibly \levelstartat
(list start value) as documented here. It should also implement list overrides, but just to get the proper reference for \ls
elements.
I have a lot of tiny RTF documents, some including bulleted/numbered lists, and I would love to be able to preserve this information and output it as XHMTL.
It seems that newline parameter is not used. I can make a pull request, but it's my first use of this tools, so please confirm that i should just replace "\n" by self.newline.
It seems also that it's a bug from rtf to text between newline from paragraph and newline from newline... But i must investigate to see exactly why.
Converting an RTF with images to a plain text file writes the binary data of the image to the plain text file. When I create a plain text file, I expect that the images will be stripped.
Hello guys,
CJK means Chinese, Japanese, and Korean. Many ancient RTF writer doesn't store these characters in Unicode, and use pyth to read CJK characters from these ancient RTF documents would cause "UnicodeDecodeError" due to CJK codecs actually use 4 hex digits not 2.
I did modified plugins/rtf15/reader.py to resolve my own needs. But I still hope someone can write a better code to deal with this issue.
1)Add this first:
from binascii import unhexlify
2)Add number 936:
# All the ones named by number in my 2.6 encodings dir
_CODEPAGES_BY_NUMBER = dict(
(x, "cp%s" % x) for x in (37, 1006, 1026, 1140, 1250, 1251, 1252, 1253, 1254, 1255,
1256, 1257, 1258, 424, 437, 500, 737, 775, 850, 852, 855,
856, 857, 860, 861, 862, 863, 864, 865, 866, 869, 874,
875, 932, 936, 949, 950))
3)Change to 'ignore' :
def read(self, source, errors='ignore'):
4):
if next == "'":
# ANSI escape, takes two hex digits
chars.extend("ansi_escape")
digits.extend(self.source.read(2))
#For some asian languages, takes two more digits
#Japanse:
if self.charset == "cp932":
if self.source.read(2) == "\\'":
digits.extend(self.source.read(2))
#Simplified Chinese:
if self.charset == "cp936":
if self.source.read(2) == "\\'":
digits.extend(self.source.read(2))
#Korean:
if self.charset == "cp949":
if self.source.read(2) == "\\'":
digits.extend(self.source.read(2))
#Traditional Chinese:
if self.charset == "cp950":
if self.source.read(2) == "\\'":
digits.extend(self.source.read(2))
break
def handle_ansi_escape(self, code):
cjk = code
code = int(code, 16)
if isinstance(self.charset, dict):
uni_code = self.charset.get(code)
if uni_code is None:
char = u'?'
else:
char = unichr(uni_code)
else:
if code <= 255:
char = chr(code).decode(self.charset, self.reader.errors)
self.content.append(char)
else:
char = unhexlify(cjk).decode(self.charset, self.reader.errors)
self.content.append(char)
Hi, I've just cloned repo and tried to run rtf reading example. Error:
Traceback (most recent call last):
File "D:\dev\soft\pyth\rtf15.py", line 11, in
doc = Rtf15Reader.read(open(filename))
File "D:\dev\soft\pyth\pyth\plugins\rtf15\reader.py", line 103, in read
return reader.go()
File "D:\dev\soft\pyth\pyth\plugins\rtf15\reader.py", line 124, in go
self.parse()
File "D:\dev\soft\pyth\pyth\plugins\rtf15\reader.py", line 143, in parse
self.group = self.stack[-1]
IndexError: list index out of range
Currently there is no support for reading a table from RTF. This should be added to meet with design goals.
unrtf has several interesting RTF test files, some of which fail to open with pyth:
http://ftp.gnu.org/gnu/unrtf/?C=M;O=D
It could be a good addition to the test suite
It would be nice if pyth would support Python 3. Is that support on the roadmap?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.