okera package

Submodules

okera.botocore_patch module

okera.botocore_patch.patch_botocore()[source]

okera.concurrency module

class okera.concurrency.BaseBackgroundTask(name)[source]

Bases: object

Name
set_handler(handler)[source]
class okera.concurrency.ConcurrencyController(worker_count=8)[source]

Bases: object

enqueueTask(task)[source]
start()[source]
stop()[source]
terminate()[source]
class okera.concurrency.TaskHandlerProcess(task_queue, output_queue, metrics_dict, errors_queue, concurrency_ctl)[source]

Bases: multiprocessing.context.Process

run()[source]

Method to be run in sub-process; can be overridden in sub-class

okera.concurrency.default_max_client_process_count()[source]

okera.odas module

class okera.odas.JsonScanTask(ctx, plan_hosts, task, max_records)[source]

Bases: okera.odas.ScanTask

deserialize(schema, columnar_records, num_records)[source]

Abstract definition to deserialize the returned dataset

class okera.odas.OkeraContext(application_name, tz=<UTC>)[source]

Bases: object

connect(host='localhost', port=12050, timeout=None)[source]

Get a connection to an ODAS cluster. This connects to the planner service.

Parameters:
  • host (str or list of hostnames) – The hostname for the planner. If a list is specified, picks a planner at random.
  • port (int, optional) – The port number for the planner. The default is 12050.
  • timeout (int, optional) – Connection timeout in seconds. Default is no timeout.
Returns:

Handle to a connection. Users should call close() when done.

Return type:

PlannerConnection

connect_worker(host='localhost', port=13050, timeout=None)[source]

Get a connection to ODAS worker.

Most users should not need to call this API directly.

Parameters:
  • host (str or list of hostnames) – The hostname for the worker. If a list is specified, picks a worker at random.
  • port (int, optional) – The port number for the worker. The default is 13050.
  • timeout (int, optional) – Connection timeout in seconds. Default is no timeout.
Returns:

Handle to a worker connection. Users must call close() when done.

Return type:

WorkerConnection

disable_auth()[source]

Disables authentication.

Returns:Returns this object.
Return type:OkeraContext
enable_kerberos(service_name, host_override=None)[source]

Enable kerberos based authentication.

Parameters:
  • service_name (str) – Authenticate to a particular okera service principal. This is typically the first part of the 3-part service principal (SERVICE_NAME/HOST@REALM).
  • host_override (str, optional) – If set, the HOST portion of the server’s service principal. If not set, then this is the resolved DNS name of the service being connected to.
Returns:

Returns this object.

Return type:

OkeraContext

enable_token_auth(token_str=None, token_file=None)[source]

Enables token based authentication.

Parameters:
  • token_str (str, optional) – Authentication token to use.
  • token_file (str, optional) – File containing token to use.
Returns:

Returns this object.

Return type:

OkeraContext

get_auth()[source]

Returns the configured auth mechanism. None if no auth is enabled.

get_name()[source]

Returns name of this application. This is recorded for diagnostics on the server.

get_timezone()[source]
get_token()[source]

Returns the token string. Note that logging this should be done with care.

class okera.odas.OkeraFsStream(planner, tbl, delimiter=', ', quote_strings=True)[source]

Bases: object

Wrapper object which behaves like a stream to send serialized results back in a byte stream based API. The API is intended to be compatible with a urllib stream object.

read(amt)[source]
class okera.odas.PandasScanTask(ctx, plan_hosts, task, max_records, options, strings_as_utf8)[source]

Bases: okera.odas.ScanTask

deserialize(schema, columnar_records, num_records)[source]

Abstract definition to deserialize the returned dataset

class okera.odas.PlannerConnection(thrift_service, ctx)[source]

Bases: object

A connection to an ODAS planner.

cat(path, as_utf8=True)[source]

Returns the object at this path as a string

Parameters:
  • path (str) – The path to the file to read.
  • as_utf8 (bool) – If true, convert the returned data as a utf-8 string (instead of binary)
Returns:

Returns the contents at the path as a string.

Return type:

str

close()[source]

Close the session and server connection.

execute_ddl(sql)[source]

Execute a DDL statement against the server.

Parameters:sql (str) – DDL statement to run
Returns:Returns the result as a table.
Return type:list(list(str))

Examples

>>> import okera
>>> ctx = okera.context()
>>> with ctx.connect(host = 'localhost', port = 12050) as conn:
...     result = conn.execute_ddl('describe okera_sample.users')
...     result
[['uid', 'string', 'Unique user id'], ['dob', 'string', 'Formatted as DD-month-YY'], ['gender', 'string', ''], ['ccn', 'string', 'Sensitive data, should not be accessible without masking.']]
execute_ddl_table_output(sql)[source]

Execute a DDL statement against the server.

Parameters:sql (str) – DDL statement to run
Returns:Returns the result as a table object.
Return type:PrettyTable

Examples

>>> import okera
>>> ctx = okera.context()
>>> with ctx.connect(host = 'localhost', port = 12050) as conn:
...     result = conn.execute_ddl_table_output('describe okera_sample.users')
...     print(result)
+--------+--------+-----------------------------------------------------------+
|  name  |  type  |                          comment                          |
+--------+--------+-----------------------------------------------------------+
|  uid   | string |                       Unique user id                      |
|  dob   | string |                  Formatted as DD-month-YY                 |
| gender | string |                                                           |
|  ccn   | string | Sensitive data, should not be accessible without masking. |
+--------+--------+-----------------------------------------------------------+
get_catalog_objects_at(path_prefix, include_views=False)[source]
Returns the objects (databases or datasets) thats registered with this
prefix path.
Parameters:
  • path_prefix (str) – The path prefix to look up objects defined with this prefix.
  • include_views (bool) – If true, also return views at this path.
Returns:

For each path with a catalog objects, the list of objects located at that path. Empty map if there are none.

Return type:

map(str, list(str))

get_protocol_version()[source]

Returns the RPC API version of the server.

list_databases()[source]

Lists all the databases in the catalog

Returns:List of database names.
Return type:list(str)

Examples

>>> import okera
>>> ctx = okera.context()
>>> with ctx.connect(host = 'localhost', port = 12050) as conn:
...     dbs = conn.list_databases()
...     'okera_sample' in dbs
True
list_dataset_names(db, filter=None)[source]

Returns the names of the datasets in this db

Parameters:
  • db (str) – Name of database to return datasets in.
  • filter (str, optional) – Substring filter on names to of datasets to return.
Returns:

List of dataset names.

Return type:

list(str)

Examples

>>> import okera
>>> ctx = okera.context()
>>> with ctx.connect(host = 'localhost', port = 12050) as conn:
...     datasets = conn.list_dataset_names('okera_sample')
...     datasets
['okera_sample.sample', 'okera_sample.users', 'okera_sample.users_ccn_masked', 'okera_sample.whoami']
list_datasets(db, filter=None)[source]

Returns the datasets in this db

Parameters:
  • db (str) – Name of database to return datasets in.
  • filter (str, optional) – Substring filter on names to of datasets to return.
Returns:

Thrift dataset objects.

Return type:

obj

Note

This API is subject to change and the returned object may not be backwards compatible.

ls(path)[source]

Lists the files in this directory

Parameters:path (str) – The path to list.
Returns:List of files located at this path.
Return type:list(str)
open(path, preload_content=True, version=None)[source]

Returns the object at this path as a byte stream

Parameters:path (str) – The path to the file to open.
Returns:Returns an object that behaves like an opened urllib3 stream.
Return type:object
plan(request, max_task_count=None, requesting_user=None)[source]

Plans the request to read from CDAS :param request: Name of dataset or SQL statement to plan scan for. :type request: str, required :param requesting_user: Name of user to request plan for, if different from

the current user.
Returns:Thrift serialized plan object.
Return type:object

Note

This API is subject to change and the returned object may not be backwards compatible.

scan_as_json(request, max_records=None, warnings=None, max_client_process_count=8, max_task_count=None, requesting_user=None, ignore_errors=False)[source]

Scans data, returning the result in json format.

Parameters:
  • request (string, required) – Name of dataset or SQL statement to scan.
  • max_records (int, optional) – Maximum number of records to return. Default is unlimited.
  • warnings (list(string), optional) – If not None, will be populated with any warnings generated for request.
Returns:

Data returned as a list of JSON objects

Return type:

list(obj)

Examples

>>> import okera
>>> ctx = okera.context()
>>> with ctx.connect(host = 'localhost', port = 12050) as conn:
...     data = conn.scan_as_json('okera_sample.sample')
...     data
[{'record': 'This is a sample test file.'}, {'record': 'It should consist of two lines.'}]
scan_as_pandas(request, max_records=None, max_client_process_count=8, max_task_count=None, requesting_user=None, options=None, ignore_errors=False, warnings=None, strings_as_utf8=False)[source]

Scans data, returning the result for pandas.

Parameters:
  • request (string, required) – Name of dataset or SQL statement to scan.
  • max_records (int, optional) – Maximum number of records to return. Default is unlimited.
  • options (dictionary, optional) – Optional key/value configs to specify to the request. Note that these options are not guaranteed to be backwards compatible.
  • warnings (list(string), optional) – If not None, will be populated with any warnings generated for request.
Returns:

Data returned as a pandas DataFrame object

Return type:

pandas DataFrame

Examples

>>> import okera
>>> ctx = okera.context()
>>> with ctx.connect(host = 'localhost', port = 12050) as conn:
...     pd = conn.scan_as_pandas('select * from okera_sample.sample')
...     print(pd)
                               record
0      b'This is a sample test file.'
1  b'It should consist of two lines.'
set_application(name)[source]

Sets the name of this session. Used for logging purposes on the server.

class okera.odas.ScanTask(name, ctx, plan_hosts, task, max_records, options)[source]

Bases: okera.concurrency.BaseBackgroundTask

deserialize(schema, columnar_records, num_records)[source]

Abstract definition to deserialize the returned dataset

class okera.odas.WorkerConnection(thrift_service, ctx)[source]

Bases: object

A connection to a CDAS worker.

close()[source]

Close the session and server connection.

close_task(handle)[source]

Closes the task.

exec_task(task, max_records=None)[source]

Executes a task to begin scanning records.

Parameters:
  • task (obj) – Description of task. This is the result from the planner’s plan() call.
  • max_records (int, optional) – Maximum number of records to return for this task. Default is unlimited.
Returns:

  • object – Handle for this task. Used in subsequent API calls.
  • object – Schema for records returned from this task.

fetch(handle)[source]

Fetch the next batch of records for this task.

get_protocol_version()[source]

Returns the RPC API version of the server.

set_application(name)[source]

Sets the name of this session. Used for logging purposes on the server.

okera.odas.context(application_name=None)[source]

Gets the top level context object to use pyokera.

Parameters:application_name (str, optional) – Name of this application. Used for logging and diagnostics.
Returns:Context object.
Return type:OkeraContext

Examples

>>> import okera
>>> ctx = okera.context()
>>> ctx                                         # doctest: +ELLIPSIS
<okera.odas.OkeraContext object at 0x...>
okera.odas.version()[source]

Returns version string of this library.

Module contents

okera.check_and_patch_botocore()[source]
okera.get_default_context()[source]
okera.initialize_default_context()[source]
okera.should_patch_botocore()[source]