Hi and welcome to this demo on how to combine the powerful capabilities of IBM Db2 Warehouse with the flexibility of a Jupyter notebook. In this video, we will use PySpark to analyze streaming data coming from an IoT device, and see how to make the streaming data land in IBM Db2 Warehouse. This notebook is connected to IBM Db2 Warehouse using Livy, which is an open-source REST service for Apache Spark. You can use Livy with sparkmagic in a Jupyter notebook to execute Spark jobs easily. Refer to the README file for information about how to configure a Livy connection. A Livy connection enables you to work on a Jupyter notebook which you open locally but that runs on a Spark cluster with IBM Db2 Warehouse. Let’s go back to our notebook. Our use case is to stream data from a wind turbmine that is sent by an IoT device. Let’s get started! First, we’ll set up the data producer. For this demo, we’ll simulate an IoT device by using a container to generate sample data, as if from a wind turbine. All of the files and instructions are available in our GitHub repository. After having built the Docker image called IoTproducer, run docker container with this command. You can also print out the generated data. Here you can see the flow of generated data second after second. For more information about how to set up this IoT producer container, refer to the documentation, and our GitHub repository. Now, let’s go back to our notebook.
We have our streaming data which we are pretending is coming from wind turbine sensors but which in fact is coming from our little container with the name IoT producer. The IoT producer container sets up an Apache Kafka server on a port, and sends a constant flow of sensor data. We are here in the Jupyter notebook, and will send the data to a Db2 table.
First things first, here are the imports. As we run the first cell to import the necessary PySpark classes, a Livvy session is started in the background. And you can see that a Spark session is also starting. It is available as Spark. In the second cell, we retrieve the Spark session created by Livvy. Next we check if Kafka Assembly is available. If not, we import it. Same for Kafka SQL. In our case, Kafka assembly and Kafka SQL have been found. So, we can go on. If otherwise, we have to get the libraries, and restart the notebook. We are now ready to send streaming queries to access data from our IoT producer container. First, we subscribe to a topic called IoT4dashdb. It is a property of the IoT producer that we have set before. dsraw is a dataframe to collect data flowing from the IoT device to the notebook. This is the typical raw Kafka format with the value field containing the actual data in a nested format. We will extract the data in the next step.
We define our other dataframe with a query, and contains the values of dsraw as strings. We are now ready for data collection and exploration. So far, we have created some objects and done some setup but we have not collected any data. Let’s do it.
Make sure the IoT producer is active before running the next cell. We define SQL dataframe streams on the basis of our two PySpark SQL dataframes, dsraw, and ds. We call then rawQuery and turbineQuery respectively. When you execute the cells, queries begin to run. They won’t stop collecting data until you say so. That’s why after a few seconds, we stop them with the stop method. We can now take a look at the data we have just collected. We have indeed the columns specified in the schema we had previously printed. Values are not easily readable yet, we will pre-process them later. The topic is indeed iot4dashdb. Inherently to streaming data, you have a timestamp column. Here we display a few rows from the table containing values as strings. You see that they are in a nested, JSON-like format. Let’s define a proper schema before so we can cleanly save our data into a Db2 table. In this cell we use a while loop to collect a predetermined amount of records from the datastream. Data is collected in batches. The size of a batch can vary because although the total amount of rows to be collected is set, the number of rows per batch is determined by the timeout variable. You can see the data corresponding to each batch here. Now, it is time to see how to write this data into a Db2 table. This cell collects some more columns of fresh data, parses the row data to extract the values from their nested format, and this batch by batch. We print the first records for you. To conclude this notebook, let’s visualize our timeseries. In this demo, we do it on the client side, not on the server side. Look at the local magic at the beginning of this cell. It ensures that plots are outputted locally. Additionally, the matplotlib inline magic ensures that figures are not plotted in a separate window, but directly embedded in this notebook. To plot our data locally, we make a bridge between remote and local with an sql query to create a view of our data stored in Db2. We obtain a Pandas Dataframe that we will use for visualization. Here we just have the points corresponding to a specific sensor and in this last cell we have plotted both the temperature and the power production of a turbine over time. Visit our GitHub repository to download this notebook and play around with it. You will also find other useful demos and tutorials for IBM Db2 Warehouse! Deploy your application easily and run it anytime on IBM Db2! Thanks for watching!
Ultimate Blogger Theme By Buywptemplates