Data API¶
After creating a Table
and some
column families, you are ready to store and retrieve data.
Cells vs. Columns vs. Column Families¶
- As we saw before, a table can have many column families.
- As we’ll see below, a table also has many rows (specified by row keys).
- Within a row, data is stored in a cell. A cell simply has a value (as bytes) and a timestamp. The number of cells in each row can be different, depending on what was stored in each row.
- Each cell lies in a column (not a column family). A column is really just a more specific modifier within a column family. A column can be present in every way, in only one or anywhere in between.
- Within a column family there can be many columns. For example within
the column family
foo
we could have columnsbar
andbaz
. These would typically be represented asfoo:bar
andfoo:baz
.
Modifying Data¶
Since data is stored in cells, which are stored in rows, the
Row
class is the only class used to
modify (write, update, delete) data in a
Table
.
Row Factory¶
To create a Row
object
row = table.row(row_key)
Unlike the previous string values we’ve used before, the row key must
be bytes
.
Direct vs. Conditional vs. Append¶
There are three ways to modify data in a table, described by the MutateRow, CheckAndMutateRow and ReadModifyWriteRow API methods.
- The direct way is via MutateRow which involves simply adding, overwriting or deleting cells.
- The conditional way is via CheckAndMutateRow. This method first checks if some filter is matched in a a given row, then applies one of two sets of mutations, depending on if a match occurred or not.
- The append way is via ReadModifyWriteRow. This simply appends (as bytes) or increments (as an integer) data in a presumed existing cell in a row.
Building Up Mutations¶
In all three cases, a set of mutations (or two sets) are built up
on a Row
before they are sent of
in a batch via commit()
:
row.commit()
To send append mutations in batch, use
commit_modifications()
:
row.commit_modifications()
We have a small set of methods on the Row
to build these mutations up.
Direct Mutations¶
Direct mutations can be added via one of four methods
set_cell()
allows a single value to be written to a columnrow.set_cell(column_family_id, column, value, timestamp=timestamp)
If the
timestamp
is omitted, the current time on the Google Cloud Bigtable server will be used when the cell is stored.The value can either by bytes or an integer (which will be converted to bytes as an unsigned 64-bit integer).
delete_cell()
deletes all cells (i.e. for all timestamps) in a given columnrow.delete_cell(column_family_id, column)
Remember, this only happens in the
row
we are using.If we only want to delete cells from a limited range of time, a
TimestampRange
can be usedrow.delete_cell(column_family_id, column, time_range=time_range)
delete_cells()
does the same thing asdelete_cell()
but accepts a list of columns in a column family rather than a single one.row.delete_cells(column_family_id, [column1, column2], time_range=time_range)
In addition, if we want to delete cells from every column in a column family, the special
ALL_COLUMNS
value can be usedrow.delete_cells(column_family_id, Row.ALL_COLUMNS, time_range=time_range)
delete()
will delete the entire rowrow.delete()
Conditional Mutations¶
Making conditional conditional modifications is essentially identical to direct modifications, but we need to specify a filter to match against in the row:
row = table.row(row_key, filter=filter)
See the Row
class for more information
about acceptable values for filter
.
The only other difference from direct modifications are that each mutation
added must specify a state
: will the mutation be applied if the filter
matches or if it fails to match.
For example
row.set_cell(column_family_id, column, value,
timestamp=timestamp, state=True)
Note
If state
is passed when no filter
is set on a
Row
, adding the mutation will fail.
Similarly, if no state
is passed when a filter
has been set,
adding the mutation will fail.
Append Mutations¶
Append mutations can be added via one of two methods
append_cell_value
appends a bytes value to an existing cell:row.append_cell_value(column_family_id, column, bytes_value)
increment_cell_value
increments an integer value in an existing cell:row.increment_cell_value(column_family_id, column, int_value)
Since only bytes are stored in a cell, the current value is decoded as an unsigned 64-bit integer before being incremented. (This happens on the Google Cloud Bigtable server, not in the library.)
Notice that no timestamp was specified. This is because append mutations operate on the latest value of the specified column.
If there are no cells in the specified column, then the empty string (bytes case) or zero (integer case) are the assumed values.
Starting Fresh¶
If accumulated mutations need to be dropped, use
clear_mutations()
row.clear_mutations()
To clear append mutations, use
clear_modification_rules()
row.clear_modification_rules()
Reading Data¶
Read Single Row from a Table¶
To make a ReadRows API request for a single row key, use
Table.read_row()
:
row_data = table.read_row(row_key)
Rather than returning a Row
, this method
returns a PartialRowData
instance. This class is used for reading and parsing data rather than for
modifying data (as Row
is).
A filter can also be applied to the
row_data = table.read_row(row_key, filter=filter)
The allowable filter
values are the same as those used for a
Row
with conditional mutations. For
more information, see the
Table.read_row()
documentation.
Stream Many Rows from a Table¶
To make a ReadRows API request for a stream of rows, use
Table.read_rows()
:
row_data = table.read_rows()
Using gRPC over HTTP/2, a continual stream of responses will be delivered.
We have a custom
returns a PartialRowsData
class to allow consuming and parsing these streams as they come.
In particular
consume_next()
pulls the next result from the stream, parses it and stores it on thePartialRowsData
instanceconsume_all()
pulls results from the stream until there are no morecancel()
closes the stream
See the PartialRowsData
documentation for more information.
As with
Table.read_row()
, an optional
filter
can be applied. In addition a start_key
and / or end_key
can be supplied for the stream, a limit
can be set and a boolean
allow_row_interleaving
can be specified to allow faster streamed results
at the potential cost of non-sequential reads.
See the Table.read_rows()
documentation for more information on the optional arguments.
Sample Keys in a Table¶
Make a SampleRowKeys API request with
Table.sample_row_keys()
:
keys_iterator = table.sample_row_keys()
The returned row keys will delimit contiguous sections of the table of approximately equal size, which can be used to break up the data for distributed tasks like mapreduces.
As with
Table.read_rows()
, the
returned keys_iterator
is connected to a cancellable HTTP/2 stream.
The next key in the result can be accessed via
next_key = keys_iterator.next()
or all keys can be iterated over via
for curr_key in keys_iterator:
do_something(curr_key)
Just as with reading, the stream can be canceled:
keys_iterator.cancel()