Data::Importers


In brief
This repository is for a Raku package for the import and export of different types of data
from both URLs and files. Automatically deduces the data type from extensions.
Remark: The built-in subs slurp
and spurt
are overloaded by definitions of this package.
The corresponding functions data-import
and data-export
can be also used.
The format of the data of the URLs or files can be specified with the named argument "format".
If format => Whatever
then the format of the data is implied by the extension of the given URL or file name.
(Currently) the recognized formats are: CSV, HTML, JSON, Image (png, jpeg, jpg), PDF, Plaintext, Text, XML.
The subs slurp
and data-import
can work with:
The subs spurt
and data-export
can work with CSV & TSV files if "Text::CSV", [HMBp1], is installed.
Remark: Since "Text::CSV" is a "heavy" to install package, it is not included in the dependencies of this one.
Remark: Similarly, "PDF::Extract" requires additional, non-Raku installation, and it targets only macOS (currently.)
That is why it is not included in the dependencies of "Data::Importers".
Installation
From Zef' ecosystem:
zef install Data::Importers
From GitHub:
zef install https://github.com/antononcube/Raku-Data-Importers.git
File examples
In order to use the slurp
definitions of this package the named argument "format"
has to be specified:
JSON file
use Data::Importers;
slurp($*CWD ~ '/resources/simple.json', format => 'json')
# {name => ingrid, value => 1}
Instead of slurp
the function data-import
can be used (no need to use "format"):
data-import($*CWD ~ '/resources/simple.json')
# {name => ingrid, value => 1}
CSV file
slurp($*CWD ~ '/resources/simple.csv', format => 'csv', headers => 'auto')
# [{X1 => 1, X2 => A, X3 => Cold} {X1 => 2, X2 => B, X3 => Warm} {X1 => 3, X2 => C, X3 => Hot}]
URLs examples
JSON URLs
Import a JSON file:
my $url = 'https://raw.githubusercontent.com/antononcube/Raku-LLM-Prompts/main/resources/prompt-stencil.json';
my $res = data-import($url, format => Whatever);
$res.WHAT;
# (Hash)
Here is the deduced type:
use Data::TypeSystem;
deduce-type($res);
# Struct([Arity, Categories, ContributedBy, Description, Keywords, Name, NamedArguments, PositionalArguments, PromptText, Topics, URL], [Int, Hash, Str, Str, Array, Str, Array, Hash, Str, Hash, Str])
Using slurp
instead of data-import
:
slurp($url)
# {Arity => 1, Categories => {Function Prompts => False, Modifier Prompts => False, Personas => False}, ContributedBy => Anton Antonov, Description => Write me!, Keywords => [], Name => Write me!, NamedArguments => [], PositionalArguments => {$a => VAL}, PromptText => -> $a='VAL' {"Something over $a."}, Topics => {AI Guidance => False, Advisor Bots => False, Character Types => False, Chats => False, Computable Output => False, Content Derived from Text => False, Education => False, Entertainment => False, Fictional Characters => False, For Fun => False, General Text Manipulation => False, Historical Figures => False, Linguistics => False, Output Formatting => False, Personalization => False, Prompt Engineering => False, Purpose Based => False, Real-World Actions => False, Roles => False, Special-Purpose Text Manipulation => False, Text Analysis => False, Text Generation => False, Text Styling => False, Wolfram Language => False, Writers => False, Writing Genres => False}, URL => None}
Image URL
Import an image:
my $imgURL = 'https://raw.githubusercontent.com/antononcube/Raku-WWW-OpenAI/main/resources/ThreeHunters.jpg';
data-import($imgURL, format => 'md-image').substr(^100)
# 
==> { $_.pick(10).sort({ $_<Package Item> }) }()
==> data-translation(field-names => <Package Item Title Rows Cols>)
Package | Item | Title | Rows | Cols |
---|
AER | BenderlyZwick | Benderly and Zwick Data: Inflation, Growth and Stock Returns | 31 | 5 |
Ecdat | Doctor | Number of Doctor Visits | 485 | 4 |
Ecdat | StrikeNb | Number of Strikes in Us Manufacturing | 108 | 3 |
Ecdat | nkill.byCountryYr | Global Terrorism Database yearly summaries | 206 | 46 |
HSAUR | water | Mortality and Water Hardness | 61 | 4 |
MASS | SP500 | Returns of the Standard and Poors 500 | 2780 | 1 |
Stat2Data | Day1Survey | First Day Survey of Statistics Students | 43 | 13 |
Stat2Data | Putts3 | Hypothetical Putting Data (Short Form) | 5 | 4 |
asaur | pharmacoSmoking | pharmacoSmoking | 125 | 14 |
openintro | male_heights | Sample of 100 male heights | 100 | 1 |
PDF URL
Here is an example of importing a PDF file into plain text:
my $txt = slurp('https://pdfobject.com/pdf/sample.pdf', format=>'text');
say text-stats($txt);
# (chars => 2851 words => 416 lines => 38)
Remark: The function text-stats
is provided by this package, "Data::Importers".
Here is a sample of the imported text:
$txt.lines[^6].join("\n")
# Sample PDF
# This is a simple PDF file. Fun fun fun.
# Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Phasellus facilisis odio sed mi.
# Curabitur suscipit. Nullam vel nisi. Etiam semper ipsum ut lectus. Proin aliquam, erat eget
# pharetra commodo, eros mi condimentum quam, sed commodo justo quam ut velit.
# Integer a erat. Cras laoreet ligula cursus enim. Aenean scelerisque velit et tellus.
TODO
- DONE Development
- DONE PDF ingestion
- TODO Export to:
- DONE JSON files
- DONE text, Markdown, org, HTML, XML files
- DONE CSV/TSV files
- TODO PDF files
- TODO Image files
- TODO Unit tests
- TODO PDF ingestion
- Some initial tests are put in.
References
[AAp1] Anton Antonov,
Image::Markup::Utilities Raku package,
(2023),
GitHub/antononcube.
[HMBp1] H. Merijn Brand,
Text::CSV Raku package,
(2015-2023),
GitHub/Tux.
[SRp1] Steve Roe,
PDF::Extract Raku package,
(2023),
GitHub/librasteve.