-
Notifications
You must be signed in to change notification settings - Fork 6.6k
Add bulk loading Python Sample #2295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
gguuss
merged 40 commits into
GoogleCloudPlatform:master
from
tonioshikanlu:tonioshikanlu-patch-1
Aug 6, 2019
Merged
Changes from all commits
Commits
Show all changes
40 commits
Select commit
Hold shift + click to select a range
3d8615a
Create bulk_load_csv
tonioshikanlu 3d5eb45
Delete bulk_load_csv
tonioshikanlu f3142a5
Create schema.ddl
tonioshikanlu 39df592
bulk load csv files and test
tonioshikanlu 686da1f
Update README.rst
tonioshikanlu 00691e8
removed unused library
tonioshikanlu 68cfb29
Update spacing to 4 whitespaces
tonioshikanlu 5f79383
Update spacing to 4 whitespaces
tonioshikanlu 38be833
Rephrase pre-requisite
tonioshikanlu cd31d18
Fixed formatting issues
tonioshikanlu 1e76ab1
Fixed formatting issues
tonioshikanlu 03adc1c
Fixed formatting issues
tonioshikanlu 87069f6
Fixed Formatting issues
tonioshikanlu 3c493c4
Fixed Formatting issues
tonioshikanlu cf17c22
Fixed Formatting issues
tonioshikanlu f2d721d
Fixed whitespace in new line issues
tonioshikanlu 2e1ce8d
Fixed formatting issues
tonioshikanlu 9da1d08
Fixed formatting issues
tonioshikanlu f6b87eb
Update batch_import_test.py
tonioshikanlu 49421e7
Update batch_import.py
tonioshikanlu e261d8e
Update batch_import_test.py
tonioshikanlu e941c09
Update batch_import_test.py
tonioshikanlu db08b85
Add files via upload
tonioshikanlu 13075d4
Update batch_import_test.py
tonioshikanlu b804766
Update batch_import_test.py
tonioshikanlu 80a86b9
Update batch_import_test.py
tonioshikanlu 40bceb8
Update batch_import.py
tonioshikanlu 0481e7d
Update batch_import_test.py
tonioshikanlu 9d07460
Update batch_import.py
tonioshikanlu b97ae16
Update batch_import.py
tonioshikanlu 7fa0701
Update README.rst
tonioshikanlu 1e05d29
Create requirements.txt
tonioshikanlu e96f284
Update README.rst
tonioshikanlu 4968b51
Update README.rst
tonioshikanlu c81a5a1
Added copyright header
tonioshikanlu f1fc89c
Added Copyright header
tonioshikanlu d304f02
Added tests for two functions
tonioshikanlu e7271f9
Update batch_import_test.py
tonioshikanlu 506a18e
Update batch_import_test.py
tonioshikanlu 6af15f7
Update batch_import_test.py
tonioshikanlu File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,77 @@ | ||
Google Cloud Spanner: Bulk Loading From CSV Python Sample | ||
=============== | ||
|
||
``Google Cloud Spanner`` is a highly scalable, transactional, managed, NewSQL database service. | ||
Cloud Spanner solves the need for a horizontally-scaling database with consistent global transactions and SQL semantics. | ||
|
||
This application demonstrates how to load data from a csv file into a Cloud | ||
Spanner database. | ||
|
||
The data contained in the csv files is sourced from the "Hacker News - Y Combinator" Bigquery `public dataset`_. | ||
|
||
.. _public dataset : | ||
https://cloud.google.com/bigquery/public-data/ | ||
|
||
Pre-requisuite | ||
----------------------- | ||
Create a database in your Cloud Spanner instance using the `schema`_ in the folder. | ||
|
||
.. _schema: | ||
schema.ddl | ||
|
||
Setup | ||
------------------------ | ||
|
||
Authentication | ||
++++++++++++++ | ||
|
||
This sample requires you to have authentication setup. Refer to the | ||
`Authentication Getting Started Guide`_ for instructions on setting up | ||
credentials for applications. | ||
|
||
.. _Authentication Getting Started Guide: | ||
https://cloud.google.com/docs/authentication/getting-started | ||
|
||
Install Dependencies | ||
++++++++++++++++++++ | ||
|
||
#. Install `pip`_ and `virtualenv`_ if you do not already have them. You may want to refer to the `Python Development Environment Setup Guide`_ for Google Cloud Platform for instructions. | ||
|
||
.. _Python Development Environment Setup Guide: | ||
https://cloud.google.com/python/setup | ||
|
||
#. Create a virtualenv. Samples are compatible with Python 2.7 and 3.4+. | ||
|
||
MACOS/LINUX | ||
|
||
.. code-block:: bash | ||
|
||
$ virtualenv env | ||
$ source env/bin/activate | ||
|
||
WINDOWS | ||
|
||
.. code-block:: bash | ||
|
||
> virtualenv env | ||
> .\env\Scripts\activate | ||
|
||
#. Install the dependencies needed to run the samples. | ||
|
||
.. code-block:: bash | ||
|
||
$ pip install -r requirements.txt | ||
|
||
.. _pip: https://pip.pypa.io/ | ||
.. _virtualenv: https://virtualenv.pypa.io/ | ||
|
||
|
||
To run sample | ||
----------------------- | ||
|
||
$ python batch_import.py instance_id database_id | ||
|
||
positional arguments: | ||
instance_id: Your Cloud Spanner instance ID. | ||
|
||
database_id : Your Cloud Spanner database ID. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,136 @@ | ||
# Copyright 2019 Google Inc. All Rights Reserved. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
# | ||
# This application demonstrates how to do batch operations from a csv file | ||
# using Cloud Spanner. | ||
# For more information, see the README.rst. | ||
|
||
|
||
import csv | ||
import time | ||
import threading | ||
import argparse | ||
from google.cloud import spanner | ||
|
||
|
||
def is_bool_null(file): | ||
# This function convertes the boolean values | ||
# in the dataset from strings to boolean data types. | ||
# It also converts the string Null to a None data | ||
# type indicating an empty cell. | ||
data = list(csv.reader(file)) | ||
# Reads each line in the csv file. | ||
for line in range(len(data)): | ||
for cell in range(len(data[line])): | ||
# Changes the string to boolean. | ||
if data[line][cell] == 'true': | ||
data[line][cell] = eval('True') | ||
# Changes blank string to python readable None type. | ||
if data[line][cell] == '': | ||
data[line][cell] = None | ||
return (data) | ||
|
||
|
||
def divide_chunks(lst, n): | ||
tonioshikanlu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
# This function divides the csv file into chunks so that the mutation will | ||
# commit every 500 rows. | ||
for i in range(0, len(lst), n): | ||
yield lst[i:i + n] | ||
|
||
|
||
def insert_data(database, filepath, table_name, column_names): | ||
# This function iterates over the list of files | ||
# belonging to the dataset and, | ||
# writes each line into cloud spanner using the batch mutation function. | ||
with open(filepath, newline='') as file: | ||
data = is_bool_null(file) | ||
data = tuple(data) | ||
l_group = list(divide_chunks(data, 500)) | ||
# Inserts each chunk of data into database | ||
for current_inserts in (l_group): | ||
if current_inserts is not None: | ||
with database.batch() as batch: | ||
batch.insert( | ||
table=table_name, | ||
columns=column_names, | ||
values=current_inserts) | ||
|
||
|
||
def main(instance_id, database_id): | ||
# Inserts sample data into the given database. | ||
# The database and table must already exist and can be created | ||
# using`create_database`. | ||
start = time.time() | ||
# File paths | ||
comments_file = 'hnewscomments.txt' | ||
stories_file = 'hnewsstories.txt' | ||
# Instantiates a spanner client | ||
spanner_client = spanner.Client() | ||
instance = spanner_client.instance(instance_id) | ||
database = instance.database(database_id) | ||
# Sets the Column names. | ||
s_columnnames = ( | ||
'id', | ||
'by', | ||
'author', | ||
'dead', | ||
'deleted', | ||
'descendants', | ||
'score', | ||
'text', | ||
'time', | ||
'time_ts', | ||
'title', | ||
'url', | ||
) | ||
c_columnnames = ( | ||
'id', | ||
'by', | ||
'author', | ||
'dead', | ||
'deleted', | ||
'parent', | ||
'ranking', | ||
'text', | ||
'time', | ||
'time_ts', | ||
) | ||
# Creates threads | ||
t1 = threading.Thread( | ||
target=insert_data, | ||
args=(database, stories_file, 'stories', s_columnnames)) | ||
t2 = threading.Thread( | ||
target=insert_data, | ||
args=(database, comments_file, 'comments', c_columnnames)) | ||
# Starting threads | ||
t1.start() | ||
t2.start() | ||
# Wait until all threads finish | ||
t1.join() | ||
t2.join() | ||
|
||
print('Finished Inserting Data.') | ||
end = time.time() | ||
print('Time: ', end - start) | ||
|
||
|
||
if __name__ == '__main__': | ||
parser = argparse.ArgumentParser( | ||
formatter_class=argparse.RawDescriptionHelpFormatter) | ||
parser.add_argument('instance_id', help='Your Cloud Spanner instance ID.') | ||
parser.add_argument('database_id', help='Your Cloud Spanner database ID.') | ||
|
||
args = parser.parse_args() | ||
|
||
main(args.instance_id, args.database_id) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,67 @@ | ||
# Copyright 2019 Google Inc. All Rights Reserved. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
# | ||
# This application demonstrates how to do batch operations from a csv file | ||
# using Cloud Spanner. | ||
# For more information, see the README.rst. | ||
"""Test for batch_import""" | ||
tonioshikanlu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
import os | ||
import pytest | ||
import batch_import | ||
from google.cloud import spanner | ||
|
||
|
||
INSTANCE_ID = os.environ['SPANNER_INSTANCE'] | ||
DATABASE_ID = 'hnewsdb' | ||
|
||
|
||
@pytest.fixture(scope='module') | ||
def spanner_instance(): | ||
spanner_client = spanner.Client() | ||
return spanner_client.instance(INSTANCE_ID) | ||
|
||
|
||
@pytest.fixture | ||
def example_database(): | ||
spanner_client = spanner.Client() | ||
instance = spanner_client.instance(INSTANCE_ID) | ||
database = instance.database(DATABASE_ID) | ||
|
||
if not database.exists(): | ||
with open('schema.ddl', 'r') as myfile: | ||
schema = myfile.read() | ||
database = instance.database(DATABASE_ID, ddl_statements=[schema]) | ||
database.create() | ||
|
||
yield database | ||
database.drop() | ||
|
||
|
||
def test_is_bool_null(): | ||
assert batch_import.is_bool_null(['12', 'true', '', '12', | ||
'jkl', '']) == [['12'], [True], | ||
[], ['12'], | ||
['jkl'], []] | ||
|
||
|
||
def test_divide_chunks(): | ||
res = list(batch_import.divide_chunks(['12', 'true', '', '12', | ||
'jkl', ''], 2)) | ||
assert res == [['12', 'true'], ['', '12'], ['jkl', '']] | ||
|
||
|
||
def test_insert_data(capsys): | ||
batch_import.main(INSTANCE_ID, DATABASE_ID) | ||
out, _ = capsys.readouterr() | ||
assert 'Finished Inserting Data.' in out |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.