glide.extract module¶
A home for common data extraction nodes
Nodes:
CSVExtract
ExcelExtract
SQLExtract
SQLParamExtract
SQLTableExtract
FileExtract
URLExtract
EmailExtract
-
class
glide.extract.
CSVExtract
(name, _log=False, _debug=False, **default_context)[source]¶ Bases:
glide.core.Node
Extract data from a CSV
-
run
(f, compression=None, open_flags='r', chunksize=None, nrows=None, reader=<class 'csv.DictReader'>, **kwargs)[source]¶ Extract data for input file and push dict rows
- Parameters
f (file path or buffer) – file path or buffer to read CSV
compression (str, optional) – param passed to pandas get_filepath_or_buffer
open_flags (str, optional) – Flags to pass to open() if f is not already an opened buffer
chunksize (int, optional) – Read data in chunks of this size
nrows (int, optional) – Limit to reading this number of rows
reader (csv Reader, optional) – The CSV reader class to use. Defaults to csv.DictReader
**kwargs – keyword arguments passed to the reader
-
-
class
glide.extract.
EmailExtract
(name, _log=False, _debug=False, **default_context)[source]¶ Bases:
glide.core.Node
Extract data from an email inbox using IMAPClient: https://imapclient.readthedocs.io
-
run
(criteria, sort=None, folder='INBOX', client=None, host=None, username=None, password=None, push_all=False, push_type='message', limit=None, **kwargs)[source]¶ Extract data from an email inbox and push the data forward.
Note
Instances of IMAPClient are NOT thread safe. They should not be shared and accessed concurrently from multiple threads.
- Parameters
criteria (str or list) – Criteria argument passed to IMAPClient.search. See https://tools.ietf.org/html/rfc3501.html#section-6.4.4.
sort (str or list, optional) – Sort criteria passed to IMAPClient.sort. Note that SORT is an extension to the IMAP4 standard so it may not be supported by all IMAP servers. See https://tools.ietf.org/html/rfc5256.
folder (str, optional) – Folder to read emails from
client (optional) – An established IMAPClient connection. If not present, the host/login information is required.
host (str, optional) – The IMAP host to connect to
username (str, optional) – The IMAP username for login
password (str, optional) – The IMAP password for login
push_all (bool, optional) – When true push all retrievd data/emails at once
push_type (str, optional) –
What type of data to extract and push from the emails. Options include:
message: push email.message.EmailMessage objects
message_id: push a list of message IDs that can be fetched
all: push a list of dict(message=<email.message.EmailMessages>, payload=<extracted payload>)
body: push a list of email bodies
attachment: push a list of attachments (an email with multiple attachments will be grouped in a sublist)
limit (int, optional) – Limit to N rows
**kwargs – Keyword arguments to pass IMAPClient if not client is passed
-
-
class
glide.extract.
ExcelExtract
(name, _log=False, _debug=False, **default_context)[source]¶ Bases:
glide.core.Node
Extract data from an Excel file
-
run
(f, dict_rows=False, **kwargs)[source]¶ Use pyexcel to read data from a file
- Parameters
f (str or buffer) – The Excel file to read. Multiple excel formats supported.
dict_rows (bool, optional) – If true the rows of each sheet will be converted to dicts with column names as keys.
**kwargs – Keyword arguments passed to pyexcel
-
-
class
glide.extract.
FileExtract
(name, _log=False, _debug=False, **default_context)[source]¶ Bases:
glide.core.Node
Extract raw data from a file
-
run
(f, compression=None, open_flags='r', chunksize=None, push_lines=False, limit=None)[source]¶ Extract raw data from a file or buffer and push contents
- Parameters
f (file path or buffer) – File path or buffer to read
compression (str, optional) – param passed to pandas get_filepath_or_buffer
open_flags (str, optional) – Flags to pass to open() if f is not already an opened buffer
chunksize (int, optional) – Push lines in chunks of this size
push_lines (bool, optional) – Push each line as it’s read instead of reading entire file and pushing
limit (int, optional) – Limit to first N lines
-
-
class
glide.extract.
SQLExtract
(*args, **kwargs)[source]¶ Bases:
glide.sql.SQLNode
Generic SQL extract Node
-
run
(sql, conn, cursor=None, cursor_type=None, params=None, chunksize=None, **kwargs)[source]¶ Extract data for input query and push fetched rows.
- Parameters
sql (str) – SQL query to run
conn – SQL connection object
cursor (optional) – SQL connection cursor object
cursor_type (optional) – SQL connection cursor type when creating a cursor is necessary
params (tuple or dict, optional) – A tuple or dict of params to pass to the execute method
chunksize (int, optional) – Fetch and push data in chunks of this size
**kwargs – Keyword arguments pushed to the execute method
-
-
class
glide.extract.
SQLParamExtract
(*args, **kwargs)[source]¶ Bases:
glide.extract.SQLExtract
Generic SQL extract node that expects SQL params as data instead of a query
-
run
(params, sql, conn, cursor=None, cursor_type=None, chunksize=None, **kwargs)[source]¶ Extract data for input params and push fetched rows.
- Parameters
params (tuple or dict) – A tuple or dict of params to pass to the execute method
sql (str) – SQL query to run
conn – SQL connection object
cursor (optional) – SQL connection cursor object
cursor_type (optional) – SQL connection cursor type when creating a cursor is necessary
chunksize (int, optional) – Fetch and push data in chunks of this size
**kwargs – Keyword arguments pushed to the execute method
-
-
class
glide.extract.
SQLTableExtract
(*args, **kwargs)[source]¶ Bases:
glide.sql.SQLNode
Generic SQL table extract node
-
run
(table, conn, cursor=None, cursor_type=None, where=None, limit=None, params=None, chunksize=None, **kwargs)[source]¶ Extract data for input table and push fetched rows
- Parameters
table (str) – SQL table name
conn – SQL connection object
cursor (optional) – SQL connection cursor object
cursor_type (optional) – SQL connection cursor type when creating a cursor is necessary
where (str, optional) – SQL where clause
limit (int, optional) – Limit to put in SQL limit clause
params (tuple or dict, optional) – A tuple or dict of params to pass to the execute method
chunksize (int, optional) – Fetch and push data in chunks of this size
**kwargs – Keyword arguments passed to cursor.execute
-
-
class
glide.extract.
URLExtract
(name, _log=False, _debug=False, **default_context)[source]¶ Bases:
glide.core.Node
Extract data from a URL with requests
-
run
(request, data_type='content', session=None, skip_raise=False, handle_paging=None, page_limit=None, push_pages=False, **kwargs)[source]¶ Extract data from a URL using requests and push response.content. Input request may be a string (GET that url) or a dictionary of args to requests.request:
http://2.python-requests.org/en/master/api/?highlight=get#requests.request
See the requests docs for information on authentication options:
https://requests.kennethreitz.org/en/master/user/authentication/
- Parameters
request (str or dict) – If str, a URL to GET. If a dict, args to requests.request
data_type (str, optional) – One of “content”, “text”, or “json” to control extraction of data from requests response.
session (optional) – A requests Session to use to make the request
skip_raise (bool, optional) – If False, raise exceptions for bad response status
handle_paging (callable, optional) –
A callable that accepts the following params and updates the args that will be passed to requests.request in place. The callable should return two values, the page data extracted from the API response and a flag denoting whether the last page has been reached. Arguments:
result: the API result of the most recent request
request: a request args dict to update
page_limit (int, optional) – If passed, use as a cap of the number of pages pulled
push_pages (bool, optional) – If true, push each page individually.
**kwargs – Keyword arguments to pass to the request method. If a dict is passed for the request parameter it overrides values of this.
-