admin 管理员组

文章数量: 1087677

python读取docx文件出错

我试图使用下面的代码从.docx获取文本,但问题是文本包含特殊字符(例如“ç”或“á”),并且代码没有正确地读取文件。在try:

from xml.etree.cElementTree import XML

except ImportError:

from xml.etree.ElementTree import XML

import zipfile

"""

Module that extract text from MS XML Word document (.docx).

(Inspired by python-docx )

"""

WORD_NAMESPACE = '{}'

PARA = WORD_NAMESPACE + 'p'

TEXT = WORD_NAMESPACE + 't'

def get_docx_text(path):

"""

Take the path of a docx file as argument, return the text in unicode.

"""

document = zipfile.ZipFile(path)

xml_content = document.read('word/document.xml')

document.close()

tree = XML(xml_content)

paragraphs = []

for paragraph in tree.getiterator(PARA):

texts = [node.text

for node in paragraph.getiterator(TEXT)

if node.text]

if texts:

paragraphs.append(''.join(texts))

return '\n\n'.join(paragraphs)

if __name__ == '__main__':

doc = def_get_docx_text('teste.docx')

print doc.split('\n')

在这个简短的例子中,原文如下:

^{pr2}$

但我得到的却是:01 A titula\xe7\xe3o gen\xe9rica de Administra\xe7\xe3o

本文标签: python读取docx文件出错