glide.extract module

A home for common data extraction nodes

Nodes:

  • CSVExtract

  • ExcelExtract

  • SQLExtract

  • SQLParamExtract

  • SQLTableExtract

  • FileExtract

  • URLExtract

  • EmailExtract

class glide.extract.CSVExtract(name, _log=False, _debug=False, **default_context)[source]

Bases: glide.core.Node

Extract data from a CSV

run(f, compression=None, open_flags='r', chunksize=None, nrows=None, reader=<class 'csv.DictReader'>, **kwargs)[source]

Extract data for input file and push dict rows

Parameters
  • f (file path or buffer) – file path or buffer to read CSV

  • compression (str, optional) – param passed to pandas get_filepath_or_buffer

  • open_flags (str, optional) – Flags to pass to open() if f is not already an opened buffer

  • chunksize (int, optional) – Read data in chunks of this size

  • nrows (int, optional) – Limit to reading this number of rows

  • reader (csv Reader, optional) – The CSV reader class to use. Defaults to csv.DictReader

  • **kwargs – keyword arguments passed to the reader

class glide.extract.EmailExtract(name, _log=False, _debug=False, **default_context)[source]

Bases: glide.core.Node

Extract data from an email inbox using IMAPClient: https://imapclient.readthedocs.io

run(criteria, sort=None, folder='INBOX', client=None, host=None, username=None, password=None, push_all=False, push_type='message', limit=None, **kwargs)[source]

Extract data from an email inbox and push the data forward.

Note

Instances of IMAPClient are NOT thread safe. They should not be shared and accessed concurrently from multiple threads.

Parameters
  • criteria (str or list) – Criteria argument passed to IMAPClient.search. See https://tools.ietf.org/html/rfc3501.html#section-6.4.4.

  • sort (str or list, optional) – Sort criteria passed to IMAPClient.sort. Note that SORT is an extension to the IMAP4 standard so it may not be supported by all IMAP servers. See https://tools.ietf.org/html/rfc5256.

  • folder (str, optional) – Folder to read emails from

  • client (optional) – An established IMAPClient connection. If not present, the host/login information is required.

  • host (str, optional) – The IMAP host to connect to

  • username (str, optional) – The IMAP username for login

  • password (str, optional) – The IMAP password for login

  • push_all (bool, optional) – When true push all retrievd data/emails at once

  • push_type (str, optional) –

    What type of data to extract and push from the emails. Options include:

    • message: push email.message.EmailMessage objects

    • message_id: push a list of message IDs that can be fetched

    • all: push a list of dict(message=<email.message.EmailMessages>, payload=<extracted payload>)

    • body: push a list of email bodies

    • attachment: push a list of attachments (an email with multiple attachments will be grouped in a sublist)

  • limit (int, optional) – Limit to N rows

  • **kwargs – Keyword arguments to pass IMAPClient if not client is passed

class glide.extract.ExcelExtract(name, _log=False, _debug=False, **default_context)[source]

Bases: glide.core.Node

Extract data from an Excel file

run(f, dict_rows=False, **kwargs)[source]

Use pyexcel to read data from a file

Parameters
  • f (str or buffer) – The Excel file to read. Multiple excel formats supported.

  • dict_rows (bool, optional) – If true the rows of each sheet will be converted to dicts with column names as keys.

  • **kwargs – Keyword arguments passed to pyexcel

class glide.extract.FileExtract(name, _log=False, _debug=False, **default_context)[source]

Bases: glide.core.Node

Extract raw data from a file

run(f, compression=None, open_flags='r', chunksize=None, push_lines=False, limit=None)[source]

Extract raw data from a file or buffer and push contents

Parameters
  • f (file path or buffer) – File path or buffer to read

  • compression (str, optional) – param passed to pandas get_filepath_or_buffer

  • open_flags (str, optional) – Flags to pass to open() if f is not already an opened buffer

  • chunksize (int, optional) – Push lines in chunks of this size

  • push_lines (bool, optional) – Push each line as it’s read instead of reading entire file and pushing

  • limit (int, optional) – Limit to first N lines

class glide.extract.SQLExtract(*args, **kwargs)[source]

Bases: glide.sql.SQLNode

Generic SQL extract Node

run(sql, conn, cursor=None, cursor_type=None, params=None, chunksize=None, **kwargs)[source]

Extract data for input query and push fetched rows.

Parameters
  • sql (str) – SQL query to run

  • conn – SQL connection object

  • cursor (optional) – SQL connection cursor object

  • cursor_type (optional) – SQL connection cursor type when creating a cursor is necessary

  • params (tuple or dict, optional) – A tuple or dict of params to pass to the execute method

  • chunksize (int, optional) – Fetch and push data in chunks of this size

  • **kwargs – Keyword arguments pushed to the execute method

class glide.extract.SQLParamExtract(*args, **kwargs)[source]

Bases: glide.extract.SQLExtract

Generic SQL extract node that expects SQL params as data instead of a query

run(params, sql, conn, cursor=None, cursor_type=None, chunksize=None, **kwargs)[source]

Extract data for input params and push fetched rows.

Parameters
  • params (tuple or dict) – A tuple or dict of params to pass to the execute method

  • sql (str) – SQL query to run

  • conn – SQL connection object

  • cursor (optional) – SQL connection cursor object

  • cursor_type (optional) – SQL connection cursor type when creating a cursor is necessary

  • chunksize (int, optional) – Fetch and push data in chunks of this size

  • **kwargs – Keyword arguments pushed to the execute method

class glide.extract.SQLTableExtract(*args, **kwargs)[source]

Bases: glide.sql.SQLNode

Generic SQL table extract node

run(table, conn, cursor=None, cursor_type=None, where=None, limit=None, params=None, chunksize=None, **kwargs)[source]

Extract data for input table and push fetched rows

Parameters
  • table (str) – SQL table name

  • conn – SQL connection object

  • cursor (optional) – SQL connection cursor object

  • cursor_type (optional) – SQL connection cursor type when creating a cursor is necessary

  • where (str, optional) – SQL where clause

  • limit (int, optional) – Limit to put in SQL limit clause

  • params (tuple or dict, optional) – A tuple or dict of params to pass to the execute method

  • chunksize (int, optional) – Fetch and push data in chunks of this size

  • **kwargs – Keyword arguments passed to cursor.execute

class glide.extract.URLExtract(name, _log=False, _debug=False, **default_context)[source]

Bases: glide.core.Node

Extract data from a URL with requests

run(request, data_type='content', session=None, skip_raise=False, handle_paging=None, page_limit=None, push_pages=False, **kwargs)[source]

Extract data from a URL using requests and push response.content. Input request may be a string (GET that url) or a dictionary of args to requests.request:

http://2.python-requests.org/en/master/api/?highlight=get#requests.request

See the requests docs for information on authentication options:

https://requests.kennethreitz.org/en/master/user/authentication/

Parameters
  • request (str or dict) – If str, a URL to GET. If a dict, args to requests.request

  • data_type (str, optional) – One of “content”, “text”, or “json” to control extraction of data from requests response.

  • session (optional) – A requests Session to use to make the request

  • skip_raise (bool, optional) – If False, raise exceptions for bad response status

  • handle_paging (callable, optional) –

    A callable that accepts the following params and updates the args that will be passed to requests.request in place. The callable should return two values, the page data extracted from the API response and a flag denoting whether the last page has been reached. Arguments:

    • result: the API result of the most recent request

    • request: a request args dict to update

  • page_limit (int, optional) – If passed, use as a cap of the number of pages pulled

  • push_pages (bool, optional) – If true, push each page individually.

  • **kwargs – Keyword arguments to pass to the request method. If a dict is passed for the request parameter it overrides values of this.