okera package¶
Subpackages¶
Submodules¶
okera.concurrency module¶
okera.odas module¶
-
class
okera.odas.
JsonScanTask
(ctx, plan_hosts, task, max_records)[source]¶ Bases:
okera.odas.ScanTask
-
class
okera.odas.
OkeraContext
(application_name, tz=<UTC>)[source]¶ Bases:
object
-
connect
(host='localhost', port=12050, timeout=None)[source]¶ Get a connection to an ODAS cluster. This connects to the planner service.
Parameters: - host (str or list of hostnames) – The hostname for the planner. If a list is specified, picks a planner at random.
- port (int, optional) – The port number for the planner. The default is 12050.
- timeout (int, optional) – Connection timeout in seconds. Default is no timeout.
Returns: Handle to a connection. Users should call close() when done.
Return type:
-
connect_worker
(host='localhost', port=13050, timeout=None)[source]¶ Get a connection to ODAS worker.
Most users should not need to call this API directly.
Parameters: - host (str or list of hostnames) – The hostname for the worker. If a list is specified, picks a worker at random.
- port (int, optional) – The port number for the worker. The default is 13050.
- timeout (int, optional) – Connection timeout in seconds. Default is no timeout.
Returns: Handle to a worker connection. Users must call close() when done.
Return type:
-
disable_auth
()[source]¶ Disables authentication.
Returns: Returns this object. Return type: OkeraContext
-
enable_kerberos
(service_name, host_override=None)[source]¶ Enable kerberos based authentication.
Parameters: - service_name (str) – Authenticate to a particular okera service principal. This is typically the first part of the 3-part service principal (SERVICE_NAME/HOST@REALM).
- host_override (str, optional) – If set, the HOST portion of the server’s service principal. If not set, then this is the resolved DNS name of the service being connected to.
Returns: Returns this object.
Return type:
-
enable_token_auth
(token_str=None, token_file=None)[source]¶ Enables token based authentication.
Parameters: - token_str (str, optional) – Authentication token to use.
- token_file (str, optional) – File containing token to use.
Returns: Returns this object.
Return type:
-
-
class
okera.odas.
OkeraFsStream
(planner, tbl, delimiter=', ', quote_strings=True)[source]¶ Bases:
object
Wrapper object which behaves like a stream to send serialized results back in a byte stream based API. The API is intended to be compatible with a urllib stream object.
-
class
okera.odas.
PandasScanTask
(ctx, plan_hosts, task, max_records, options, strings_as_utf8)[source]¶ Bases:
okera.odas.ScanTask
-
class
okera.odas.
PlannerConnection
(thrift_service, ctx)[source]¶ Bases:
object
A connection to an ODAS planner.
-
cat
(path, as_utf8=True)[source]¶ Returns the object at this path as a string
Parameters: - path (str) – The path to the file to read.
- as_utf8 (bool) – If true, convert the returned data as a utf-8 string (instead of binary)
Returns: Returns the contents at the path as a string.
Return type: str
-
execute_ddl
(sql)[source]¶ Execute a DDL statement against the server.
Parameters: sql (str) – DDL statement to run Returns: Returns the result as a table. Return type: list(list(str)) Examples
>>> import okera >>> ctx = okera.context() >>> with ctx.connect(host = 'localhost', port = 12050) as conn: ... result = conn.execute_ddl('describe okera_sample.users') ... result [['uid', 'string', 'Unique user id'], ['dob', 'string', 'Formatted as DD-month-YY'], ['gender', 'string', ''], ['ccn', 'string', 'Sensitive data, should not be accessible without masking.']]
-
execute_ddl_table_output
(sql)[source]¶ Execute a DDL statement against the server.
Parameters: sql (str) – DDL statement to run Returns: Returns the result as a table object. Return type: PrettyTable Examples
>>> import okera >>> ctx = okera.context() >>> with ctx.connect(host = 'localhost', port = 12050) as conn: ... result = conn.execute_ddl_table_output('describe okera_sample.users') ... print(result) +--------+--------+-----------------------------------------------------------+ | name | type | comment | +--------+--------+-----------------------------------------------------------+ | uid | string | Unique user id | | dob | string | Formatted as DD-month-YY | | gender | string | | | ccn | string | Sensitive data, should not be accessible without masking. | +--------+--------+-----------------------------------------------------------+
-
get_catalog_objects_at
(path_prefix, include_views=False)[source]¶ - Returns the objects (databases or datasets) thats registered with this
- prefix path.
Parameters: - path_prefix (str) – The path prefix to look up objects defined with this prefix.
- include_views (bool) – If true, also return views at this path.
Returns: For each path with a catalog objects, the list of objects located at that path. Empty map if there are none.
Return type: map(str, list(str))
-
list_databases
()[source]¶ Lists all the databases in the catalog
Returns: List of database names. Return type: list(str) Examples
>>> import okera >>> ctx = okera.context() >>> with ctx.connect(host = 'localhost', port = 12050) as conn: ... dbs = conn.list_databases() ... 'okera_sample' in dbs True
-
list_dataset_names
(db, filter=None)[source]¶ Returns the names of the datasets in this db
Parameters: - db (str) – Name of database to return datasets in.
- filter (str, optional) – Substring filter on names to of datasets to return.
Returns: List of dataset names.
Return type: list(str)
Examples
>>> import okera >>> ctx = okera.context() >>> with ctx.connect(host = 'localhost', port = 12050) as conn: ... datasets = conn.list_dataset_names('okera_sample') ... datasets ['okera_sample.sample', 'okera_sample.users', 'okera_sample.users_ccn_masked', 'okera_sample.whoami']
-
list_datasets
(db, filter=None)[source]¶ Returns the datasets in this db
Parameters: - db (str) – Name of database to return datasets in.
- filter (str, optional) – Substring filter on names to of datasets to return.
Returns: Thrift dataset objects.
Return type: obj
Note
This API is subject to change and the returned object may not be backwards compatible.
-
ls
(path)[source]¶ Lists the files in this directory
Parameters: path (str) – The path to list. Returns: List of files located at this path. Return type: list(str)
-
open
(path, preload_content=True, version=None)[source]¶ Returns the object at this path as a byte stream
Parameters: path (str) – The path to the file to open. Returns: Returns an object that behaves like an opened urllib3 stream. Return type: object
-
plan
(request, max_task_count=None, requesting_user=None)[source]¶ Plans the request to read from CDAS :param request: Name of dataset or SQL statement to plan scan for. :type request: str, required :param requesting_user: Name of user to request plan for, if different from
the current user.Returns: Thrift serialized plan object. Return type: object Note
This API is subject to change and the returned object may not be backwards compatible.
-
scan_as_json
(request, max_records=None, warnings=None, max_client_process_count=8, max_task_count=None, requesting_user=None, ignore_errors=False)[source]¶ Scans data, returning the result in json format.
Parameters: - request (string, required) – Name of dataset or SQL statement to scan.
- max_records (int, optional) – Maximum number of records to return. Default is unlimited.
- warnings (list(string), optional) – If not None, will be populated with any warnings generated for request.
Returns: Data returned as a list of JSON objects
Return type: list(obj)
Examples
>>> import okera >>> ctx = okera.context() >>> with ctx.connect(host = 'localhost', port = 12050) as conn: ... data = conn.scan_as_json('okera_sample.sample') ... data [{'record': 'This is a sample test file.'}, {'record': 'It should consist of two lines.'}]
-
scan_as_pandas
(request, max_records=None, max_client_process_count=8, max_task_count=None, requesting_user=None, options=None, ignore_errors=False, warnings=None, strings_as_utf8=False)[source]¶ Scans data, returning the result for pandas.
Parameters: - request (string, required) – Name of dataset or SQL statement to scan.
- max_records (int, optional) – Maximum number of records to return. Default is unlimited.
- options (dictionary, optional) – Optional key/value configs to specify to the request. Note that these options are not guaranteed to be backwards compatible.
- warnings (list(string), optional) – If not None, will be populated with any warnings generated for request.
Returns: Data returned as a pandas DataFrame object
Return type: pandas DataFrame
Examples
>>> import okera >>> ctx = okera.context() >>> with ctx.connect(host = 'localhost', port = 12050) as conn: ... pd = conn.scan_as_pandas('select * from okera_sample.sample') ... print(pd) record 0 b'This is a sample test file.' 1 b'It should consist of two lines.'
-
-
class
okera.odas.
WorkerConnection
(thrift_service, ctx)[source]¶ Bases:
object
A connection to a CDAS worker.
-
exec_task
(task, max_records=None)[source]¶ Executes a task to begin scanning records.
Parameters: - task (obj) – Description of task. This is the result from the planner’s plan() call.
- max_records (int, optional) – Maximum number of records to return for this task. Default is unlimited.
Returns: - object – Handle for this task. Used in subsequent API calls.
- object – Schema for records returned from this task.
-
-
okera.odas.
context
(application_name=None)[source]¶ Gets the top level context object to use pyokera.
Parameters: application_name (str, optional) – Name of this application. Used for logging and diagnostics. Returns: Context object. Return type: OkeraContext Examples
>>> import okera >>> ctx = okera.context() >>> ctx # doctest: +ELLIPSIS <okera.odas.OkeraContext object at 0x...>