Rand Stats

Data::Importers

zef:antononcube

Data::Importers

Actions Status Actions Status

In brief

This repository is for a Raku package for the import and export of different types of data from both URLs and files. Automatically deduces the data type from extensions.

Remark: The built-in subs slurp and spurt are overloaded by definitions of this package. The corresponding functions data-import and data-export can be also used.

The format of the data of the URLs or files can be specified with the named argument "format". If format => Whatever then the format of the data is implied by the extension of the given URL or file name.

(Currently) the recognized formats are: CSV, HTML, JSON, Image (png, jpeg, jpg), PDF, Plaintext, Text, XML.

The subs slurp and data-import can work with:

The subs spurt and data-export can work with CSV & TSV files if "Text::CSV", [HMBp1], is installed.

Remark: Since "Text::CSV" is a "heavy" to install package, it is not included in the dependencies of this one.

Remark: Similarly, "PDF::Extract" requires additional, non-Raku installation, and it targets only macOS (currently.) That is why it is not included in the dependencies of "Data::Importers".


Installation

From Zef' ecosystem:

zef install Data::Importers

From GitHub:

zef install https://github.com/antononcube/Raku-Data-Importers.git

File examples

In order to use the slurp definitions of this package the named argument "format" has to be specified:

JSON file

use Data::Importers;

slurp($*CWD ~ '/resources/simple.json', format => 'json')
# {name => ingrid, value => 1}

Instead of slurp the function data-import can be used (no need to use "format"):

data-import($*CWD ~ '/resources/simple.json')
# {name => ingrid, value => 1}

CSV file

slurp($*CWD ~ '/resources/simple.csv', format => 'csv', headers => 'auto')
# [{X1 => 1, X2 => A, X3 => Cold} {X1 => 2, X2 => B, X3 => Warm} {X1 => 3, X2 => C, X3 => Hot}]

URLs examples

JSON URLs

Import a JSON file:

my $url = 'https://raw.githubusercontent.com/antononcube/Raku-LLM-Prompts/main/resources/prompt-stencil.json';

my $res = data-import($url, format => Whatever);

$res.WHAT;
# (Hash)

Here is the deduced type:

use Data::TypeSystem;

deduce-type($res);
# Struct([Arity, Categories, ContributedBy, Description, Keywords, Name, NamedArguments, PositionalArguments, PromptText, Topics, URL], [Int, Hash, Str, Str, Array, Str, Array, Hash, Str, Hash, Str])

Using slurp instead of data-import:

slurp($url)
# {Arity => 1, Categories => {Function Prompts => False, Modifier Prompts => False, Personas => False}, ContributedBy => Anton Antonov, Description => Write me!, Keywords => [], Name => Write me!, NamedArguments => [], PositionalArguments => {$a => VAL}, PromptText => -> $a='VAL' {"Something over $a."}, Topics => {AI Guidance => False, Advisor Bots => False, Character Types => False, Chats => False, Computable Output => False, Content Derived from Text => False, Education => False, Entertainment => False, Fictional Characters => False, For Fun => False, General Text Manipulation => False, Historical Figures => False, Linguistics => False, Output Formatting => False, Personalization => False, Prompt Engineering => False, Purpose Based => False, Real-World Actions => False, Roles => False, Special-Purpose Text Manipulation => False, Text Analysis => False, Text Generation => False, Text Styling => False, Wolfram Language => False, Writers => False, Writing Genres => False}, URL => None}

Image URL

Import an image:

my $imgURL = 'https://raw.githubusercontent.com/antononcube/Raku-WWW-OpenAI/main/resources/ThreeHunters.jpg';

data-import($imgURL, format => 'md-image').substr(^100)
# ![](data:image/jpeg;base64,/9j/4AAQSkZJRgABAQEASABIAAD/2wBDAAUEBAUEAwUFBAUGBgUGCA4JCAcHCBEMDQoOFBEVF

Remark: Image ingestion is delegated to "Image::Markup::Utilities", [AAp1]. The format value 'md-image' can be used to display images in Markdown files or Jupyter notebooks.

CSV URL

Here we ingest a CSV file and show a table of a 10-rows sample:

use Data::Translators;

'https://raw.githubusercontent.com/antononcube/Raku-Data-ExampleDatasets/main/resources/dfRdatasets.csv'
==> slurp(headers => 'auto') 
==> { $_.pick(10).sort({ $_<Package Item> }) }()
==> data-translation(field-names => <Package Item Title Rows Cols>)
PackageItemTitleRowsCols
AERBenderlyZwickBenderly and Zwick Data: Inflation, Growth and Stock Returns315
EcdatDoctorNumber of Doctor Visits4854
EcdatStrikeNbNumber of Strikes in Us Manufacturing1083
Ecdatnkill.byCountryYrGlobal Terrorism Database yearly summaries20646
HSAURwaterMortality and Water Hardness614
MASSSP500Returns of the Standard and Poors 50027801
Stat2DataDay1SurveyFirst Day Survey of Statistics Students4313
Stat2DataPutts3Hypothetical Putting Data (Short Form)54
asaurpharmacoSmokingpharmacoSmoking12514
openintromale_heightsSample of 100 male heights1001

PDF URL

Here is an example of importing a PDF file into plain text:

my $txt = slurp('https://pdfobject.com/pdf/sample.pdf', format=>'text');

say text-stats($txt);
# (chars => 2851 words => 416 lines => 38)

Remark: The function text-stats is provided by this package, "Data::Importers".

Here is a sample of the imported text:

$txt.lines[^6].join("\n")
# Sample PDF
# This is a simple PDF file. Fun fun fun.
# Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Phasellus facilisis odio sed mi.
# Curabitur suscipit. Nullam vel nisi. Etiam semper ipsum ut lectus. Proin aliquam, erat eget
# pharetra commodo, eros mi condimentum quam, sed commodo justo quam ut velit.
# Integer a erat. Cras laoreet ligula cursus enim. Aenean scelerisque velit et tellus.

TODO


References

[AAp1] Anton Antonov, Image::Markup::Utilities Raku package, (2023), GitHub/antononcube.

[HMBp1] H. Merijn Brand, Text::CSV Raku package, (2015-2023), GitHub/Tux.

[SRp1] Steve Roe, PDF::Extract Raku package, (2023), GitHub/librasteve.