glide.extract module¶

A home for common data extraction nodes

Nodes:

CSVExtract
ExcelExtract
SQLExtract
SQLParamExtract
SQLTableExtract
FileExtract
URLExtract
EmailExtract

class glide.extract.CSVExtract(name, _log=False, _debug=False, **default_context)[source]¶

Bases: glide.core.Node

Extract data from a CSV

run(f, compression=None, open_flags='r', chunksize=None, nrows=None, reader=<class 'csv.DictReader'>, **kwargs)[source]¶

Extract data for input file and push dict rows

Parameters

f (file path or buffer) – file path or buffer to read CSV
compression (str, optional) – param passed to pandas get_filepath_or_buffer
open_flags (str, optional) – Flags to pass to open() if f is not already an opened buffer
chunksize (int, optional) – Read data in chunks of this size
nrows (int, optional) – Limit to reading this number of rows
reader (csv Reader, optional) – The CSV reader class to use. Defaults to csv.DictReader
**kwargs – keyword arguments passed to the reader

class glide.extract.EmailExtract(name, _log=False, _debug=False, **default_context)[source]¶

Bases: glide.core.Node

Extract data from an email inbox using IMAPClient: https://imapclient.readthedocs.io

run(criteria, sort=None, folder='INBOX', client=None, host=None, username=None, password=None, push_all=False, push_type='message', limit=None, **kwargs)[source]¶

Extract data from an email inbox and push the data forward.

Note

Instances of IMAPClient are NOT thread safe. They should not be shared and accessed concurrently from multiple threads.

Parameters

criteria (str or list) – Criteria argument passed to IMAPClient.search. See https://tools.ietf.org/html/rfc3501.html#section-6.4.4.
sort (str or list, optional) – Sort criteria passed to IMAPClient.sort. Note that SORT is an extension to the IMAP4 standard so it may not be supported by all IMAP servers. See https://tools.ietf.org/html/rfc5256.
folder (str, optional) – Folder to read emails from
client (optional) – An established IMAPClient connection. If not present, the host/login information is required.
host (str, optional) – The IMAP host to connect to
username (str, optional) – The IMAP username for login
password (str, optional) – The IMAP password for login
push_all (bool, optional) – When true push all retrievd data/emails at once
push_type (str, optional) –
What type of data to extract and push from the emails. Options include:
- message: push email.message.EmailMessage objects
- message_id: push a list of message IDs that can be fetched
- all: push a list of dict(message=<email.message.EmailMessages>, payload=<extracted payload>)
- body: push a list of email bodies
- attachment: push a list of attachments (an email with multiple attachments will be grouped in a sublist)
limit (int, optional) – Limit to N rows
**kwargs – Keyword arguments to pass IMAPClient if not client is passed

class glide.extract.ExcelExtract(name, _log=False, _debug=False, **default_context)[source]¶

Bases: glide.core.Node

Extract data from an Excel file

run(f, dict_rows=False, **kwargs)[source]¶

Use pyexcel to read data from a file

Parameters

f (str or buffer) – The Excel file to read. Multiple excel formats supported.
dict_rows (bool, optional) – If true the rows of each sheet will be converted to dicts with column names as keys.
**kwargs – Keyword arguments passed to pyexcel

class glide.extract.FileExtract(name, _log=False, _debug=False, **default_context)[source]¶

Bases: glide.core.Node

Extract raw data from a file

run(f, compression=None, open_flags='r', chunksize=None, push_lines=False, limit=None)[source]¶

Extract raw data from a file or buffer and push contents

Parameters

f (file path or buffer) – File path or buffer to read
compression (str, optional) – param passed to pandas get_filepath_or_buffer
open_flags (str, optional) – Flags to pass to open() if f is not already an opened buffer
chunksize (int, optional) – Push lines in chunks of this size
push_lines (bool, optional) – Push each line as it’s read instead of reading entire file and pushing
limit (int, optional) – Limit to first N lines

class glide.extract.SQLExtract(*args, **kwargs)[source]¶

Bases: glide.sql.SQLNode

Generic SQL extract Node

run(sql, conn, cursor=None, cursor_type=None, params=None, chunksize=None, **kwargs)[source]¶

Extract data for input query and push fetched rows.

Parameters

sql (str) – SQL query to run
conn – SQL connection object
cursor (optional) – SQL connection cursor object
cursor_type (optional) – SQL connection cursor type when creating a cursor is necessary
params (tuple or dict, optional) – A tuple or dict of params to pass to the execute method
chunksize (int, optional) – Fetch and push data in chunks of this size
**kwargs – Keyword arguments pushed to the execute method

class glide.extract.SQLParamExtract(*args, **kwargs)[source]¶

Bases: glide.extract.SQLExtract

Generic SQL extract node that expects SQL params as data instead of a query

run(params, sql, conn, cursor=None, cursor_type=None, chunksize=None, **kwargs)[source]¶

Extract data for input params and push fetched rows.

Parameters

params (tuple or dict) – A tuple or dict of params to pass to the execute method
sql (str) – SQL query to run
conn – SQL connection object
cursor (optional) – SQL connection cursor object
cursor_type (optional) – SQL connection cursor type when creating a cursor is necessary
chunksize (int, optional) – Fetch and push data in chunks of this size
**kwargs – Keyword arguments pushed to the execute method

class glide.extract.SQLTableExtract(*args, **kwargs)[source]¶

Bases: glide.sql.SQLNode

Generic SQL table extract node

run(table, conn, cursor=None, cursor_type=None, where=None, limit=None, params=None, chunksize=None, **kwargs)[source]¶

Extract data for input table and push fetched rows

Parameters

table (str) – SQL table name
conn – SQL connection object
cursor (optional) – SQL connection cursor object
cursor_type (optional) – SQL connection cursor type when creating a cursor is necessary
where (str, optional) – SQL where clause
limit (int, optional) – Limit to put in SQL limit clause
params (tuple or dict, optional) – A tuple or dict of params to pass to the execute method
chunksize (int, optional) – Fetch and push data in chunks of this size
**kwargs – Keyword arguments passed to cursor.execute

class glide.extract.URLExtract(name, _log=False, _debug=False, **default_context)[source]¶

Bases: glide.core.Node

Extract data from a URL with requests

run(request, data_type='content', session=None, skip_raise=False, handle_paging=None, page_limit=None, push_pages=False, **kwargs)[source]¶

Extract data from a URL using requests and push response.content. Input request may be a string (GET that url) or a dictionary of args to requests.request:

http://2.python-requests.org/en/master/api/?highlight=get#requests.request

See the requests docs for information on authentication options:

https://requests.kennethreitz.org/en/master/user/authentication/

Parameters

request (str or dict) – If str, a URL to GET. If a dict, args to requests.request
data_type (str, optional) – One of “content”, “text”, or “json” to control extraction of data from requests response.
session (optional) – A requests Session to use to make the request
skip_raise (bool, optional) – If False, raise exceptions for bad response status
handle_paging (callable, optional) –
A callable that accepts the following params and updates the args that will be passed to requests.request in place. The callable should return two values, the page data extracted from the API response and a flag denoting whether the last page has been reached. Arguments:
- result: the API result of the most recent request
- request: a request args dict to update
page_limit (int, optional) – If passed, use as a cap of the number of pages pulled
push_pages (bool, optional) – If true, push each page individually.
**kwargs – Keyword arguments to pass to the request method. If a dict is passed for the request parameter it overrides values of this.

glide.extract module¶

Navigation

Related Topics