How to build your data science or machine learning model, when you store data into relational database for your e-commerce site.
Scope:
The scope of this article is to share how to dive deeper into your e-comerce client data and start making data science relatively easy and cheap, when you need to deal with data from your database and you want to work with Python relying on Jupyter Notebook, where time of execution is important. Especially focusing on the relational database.
We all store data into files or into databases and as Data Scientists we need to have fast access to it and to be able to train our models.
Our task
Now we have a task to build a model combining different sources from emails, click data which we capture, mobile apps and sales, see the diagram above. We have several different sources all stored into Postgres Database Tables with various foreign keys between each table. In order to process the data first we need to access it from the database, structure it and then start playing with it. If you make SQL requests for more than 50 million records the performance is not so great.
Our Approach:
In our case we have Postgres which is setup on AWS using RDS service and want to use Jupyter Notebook in order to build a model with minimum costs. This is for research purposes and we are not talking about putting it into production, still.
We split the task into several steps:
- Gather the data and join all tables creating SQL request
- Download and process the data
- Load the data and work as DataFrame in Python at Jupyter Notebook
The steps:
We will share some example code and discoveries we made.
1. Gather the data and join all tables creating SQL request
We noticed that there are two approaches in Python. One is to use standard SQL requests or the other one is to use sqlalchemy library, both methods are good and have their advantages. For simplicity we are using sqlalchemy in production, but what we found that building complex request is a pain, so we decided to use SQL request and import library psycopg2. There many ways to grab the data from database, something which need to be considered is when you join tables what kind of joins you want to do left, right, inner here is a link from w3scools of what I mean and here is example code:
sql = “SELECT track.ts, track.site_id, user.name, user.email, cookies.uid
FROM public.track
left JOIN public.user
ON track.user_id=user.id
left JOIN public.cookies
ON track.cookies_id=cookies.id
order by track.ts”
If you need more fields or linking more tables you can follow the example and upgrade the code. Once it is ready you can use different Database clients and check your code. We are using pgAdmin4.
2. Download and process the data
There are various ways to download the data and we want to share our case, how we approached it. We have 15-20 GB of data, which is a relatively small amount of data and decided to make one request to Postgres and store all the data into files which are then easy to be manipulated and worked with. There are several approaches which we observed:
a). Execute sql code and store directly to VM.
We decided to store the files into S3 at AWS and then to read from them.
This is only possible if you are on AWS Aurora. Read the article and the code should look like this.
SELECT * FROM employees INTO OUTFILE S3 ‘s3-us-west-2://aurora-select-into-s3-pdx/sample_employee_data’
FIELDS TERMINATED BY ‘,’
LINES TERMINATED BY ‘\n’;
b). Run python code on your machine and upload the data to S3.
It is not that time efficient first to download it all and then to upload the data.
c). Use a virtual machine to download and upload data on S3.
There are different approaches probably the fastest one is if you run EC2 or SageMaker Jupyter Notebook. We decided to choose a cheaper option and use Google Colab. Something else which we consider if possible to download and upload data in chunksizes in that case we will not need big RAM instance to transfer the data.
The code looks like this:
connection = psycopg2.connect(user=”your_user”,
password=”your_password”,
host=”your_host_link”,
port=”5432″,
database=”your_db_name”)
colnames = [‘col1’, ‘col2’,,,]
counter = 0
ACCESS_ID = ‘your_S3_id’
ACCESS_KEY = ‘your_s3_key’
s3_resource = boto3.resource(‘s3’,
aws_access_key_id=ACCESS_ID,
aws_secret_access_key= ACCESS_KEY)
for df in pd.read_sql(sql, con=connection, chunksize=10**6, columns=colnames, index_col=None):
csv_buffer = StringIO()
df.to_csv(csv_buffer, index = False)
s3_resource.Object(“your_bucket_name”, f’file_name_{counter}.csv’).put(Body=csv_buffer.getvalue())
counter += 1
print(f”the counter is {counter} and have {len(df)} records”)
With this code we download and upload small pieces of data controlling the chunksize
3. Load the data and work as DataFrame in Python at Jupyter Notebook
The easiest way is to use SageMaker and use their Jupyter Notebook where you will have access to choose the machines sizes you need, but something we noticed, that it is almost impossible to share code between a team for working with notebook, therefore we decided to use the free version for research and collaboration called Google Colab, where we change code and files, also you have up to 24 GB of GPU which is sufficient for data models and analysis.
The code which we used to load the files is:
s3 = boto3.resource(‘s3’,
aws_access_key_id=ACCESS_ID,
aws_secret_access_key= ACCESS_KEY)
bucket = s3.Bucket(‘your_bucket_name’)
df = pd.DataFrame()
for obj in bucket.objects.all():
key = obj.key
body = obj.get()[‘Body’].read()
df_s = pd.read_csv(io.BytesIO(body))
df = df.append(df_s)
In that case you have all the files downloaded to DataFrame and you are ready to do the magic.
We hope this post was helpful.
Summary: Just to sum up we use SQL, Google Colab to download and transfer the data to S3 and then Colab to download the data from S3 and run our models.
Conclusion: This is stable solution for small teams with minimal costs and data up to 20 GB
Good luck.