[![Actions Status](https://github.com/kaz-utashiro/optex-textconv/workflows/test/badge.svg)](https://github.com/kaz-utashiro/optex-textconv/actions) [![MetaCPAN Release](https://badge.fury.io/pl/App-optex-textconv.svg)](https://metacpan.org/release/App-optex-textconv) # NAME textconv - optex module to replace document file by its text contents # VERSION Version 1.01 # SYNOPSIS optex command -Mtextconv optex command -Mtc (alias module) optex command -Mtextconv::load=pandoc # DESCRIPTION This module replaces several sort of filenames by node representing its text information. File itself is not altered. For example, you can check the text difference between MS word files like this: $ optex diff -Mtextconv OLD.docx NEW.docx If you have symbolic link named **diff** to **optex**, and following setting in your `~/.optex.d/diff.rc`: option default --textconv option --textconv -Mtextconv $<move> Next command simply produces the same result. $ diff OLD.docx NEW.docx ## FILE FORMATS - msdoc Microsoft office format files in XML (.docx, .pptx, .xlsx, .docm, .pptm, .xlsm). Use [App::optex::textconv::msdoc](https://metacpan.org/pod/App::optex::textconv::msdoc), [App::optex::textconv::ooxml](https://metacpan.org/pod/App::optex::textconv::ooxml), [App::optex::textconv::ooxml::regex](https://metacpan.org/pod/App::optex::textconv::ooxml::regex), [App::optex::textconv::ooxml::xslt](https://metacpan.org/pod/App::optex::textconv::ooxml::xslt). - doc Microsoft Word file. Use [Text::Extract::Word](https://metacpan.org/pod/Text::Extract::Word) module. - xls Microsoft Excel file. Use [Spreadsheet::ParseExcel](https://metacpan.org/pod/Spreadsheet::ParseExcel) module. - pdf Use [pdftotext(1)](http://man.he.net/man1/pdftotext) command to covert PDF format. See [App::optex::textconv::pdf](https://metacpan.org/pod/App::optex::textconv::pdf). - jpeg JPEG files is converted to their exif information (.jpeg, .jpg). - http Name start with `http://` or `https://` is converted to text data translated by [w3c(1)](http://man.he.net/man1/w3c) command. - pandoc Use [pandoc](https://pandoc.org/) command to translate Microsoft office document in XML format. See [App::optex::textconv::pandoc](https://metacpan.org/pod/App::optex::textconv::pandoc). - tika Use [Apache Tika](https://tika.apache.org/) command to translate Microsoft office document in XML and non-XML format. See [App::optex::textconv::tika](https://metacpan.org/pod/App::optex::textconv::tika). # MICROSOFT DOCUMENTS Microsoft office document in XML format (.docx, .pptx, .xlsx) is converted to plain text by original code implemented in [App::optex::textconv::msdoc](https://metacpan.org/pod/App::optex::textconv::msdoc) module. Algorithm used in this module is extremely simple, and consequently runs fast. Two module are included in this distribution to use other external converter program, **pandoc** and **tika**, those implement much more serious algorithm. They can be invoked by calling **load** function with module declaration like: optex -Mtextconv::load=pandoc optex -Mtextconv::load=tika # INSTALL ## CPANM $ cpanm App::optex::textconv or $ curl -sL http://cpanmin.us | perl - App::optex::textconv ## GIT These are sample configurations using [App::optex::textconv](https://metacpan.org/pod/App::optex::textconv) in git environment. ~/.gitconfig [diff "msdoc"] textconv = optex -Mtextconv cat [diff "pdf"] textconv = optex -Mtextconv cat [diff "jpg"] textconv = optex -Mtextconv cat ~/.config/git/attributes *.docx diff=msdoc *.pptx diff=msdoc *.xlmx diff=msdoc *.pdf diff=pdf *.jpg diff=jpg About other GIT related setting, see [https://github.com/kaz-utashiro/sdif-tools](https://github.com/kaz-utashiro/sdif-tools). # SEE ALSO [https://github.com/kaz-utashiro/optex](https://github.com/kaz-utashiro/optex) [https://github.com/kaz-utashiro/optex-textconv](https://github.com/kaz-utashiro/optex-textconv) [https://qiita.com/kaz-utashiro/items/23fd825bd325240592c2](https://qiita.com/kaz-utashiro/items/23fd825bd325240592c2) [https://github.com/kaz-utashiro/sdif-tools](https://github.com/kaz-utashiro/sdif-tools) # AUTHOR Kazumasa Utashiro # LICENSE Copyright 2019-2022 Kazumasa Utashiro. This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.