machine_learningaws deeplens

I recently bought this device in anticipation of getting accustomed with AWS Machine Learning services. I thought it would be a nice and fun introduction to build a DIY project using ML. This weekend I had a chance to play around with it.

Unboxing

Registering

First thing you need to do is register your DeepLens which can be done here

Alternatively, you can log in to AWS Console, go to AWS DeepLens service and click Register Device. As of this writing this service is only available in Tokyo, N.Virginia and Frankfurt. As closest to me is Frankfurt I chose that one.

Creating a Project

Now that the devices is ready, I went to Projects section and created a new project based on a template. For this initial exploration mission I chose Bird Classification template.

When you choose your template it shows a description of the template and how it works

Once you create the project it creates the necessary resources. For example in my template it created 2 Lambda functions:

Now we are ready to deploy the project to the device so we can simply click on Deploy to Device button on top right-hand corner in the project page:

After that you choose your DeepLens from the avialble device list (in my case I had one device) and click Review then Deploy

Testing the project

First I followed the steps to install custom certificate so that I can start viewing the video feed in my browser:

Because DeepLens doesn’t know if there’s a bird or not in the picture it constantly makes a guess. Even an empty carton box looks like a Long Tailed Jaeger with 2.1% confidence apparently!

When I presented an image I found from the Internet the results significantly improved though (thankfully, otherwise this would all be for nothing!)

According to Google the bird in the picture is indeed a Kingfisher.

Next test is the bird I saw on the roof and identified as a Mallard. Let’s see what DeepLens thinks about it:

So with 17.5% confidens DeepLenms agrees that it is a Mallard. One thing to note is that the angle of the image significantly changes the prediction. Even a slight change in the angle makes the prediction shift from 3% to 45%. But it’s just a test project template anyway so I’m not looking for too much accuracy at this point.

Final test: I downloaded the dataset used to train the algorithm. It’s a 1.1GB download full of images. To compare with the last Mallard test I got one picture from the Mallard folder and showed it to DeepLens:

As shown above the confidence increased to 55%. So probably if I use the pictures I took to train the algorithm I would be able to get much better results.

Conclusion

This was just a first step to set up and explore the device. Hopefully in the comin days I’ll be doing more meaningful projects with it and post those too.

Resources

docker registry, self_hosted

Working with Docker is great but when you want to deploy your applications to another server you need a registry to push your images so that you can pull them from the other end. To have that capability in my dev environment I decided to setup my own self-hosted Docker registry.

For the sake of brevity, I will omit creating Raspberry Pi SD Card and installing Docker on it. There are lots of gret videos and articles out there already.

Self-hosted vs Hosted

When it comes to hosted Docker registries, there are lots of free and paid options.

Benefits of self-hosted registry:

  • See the storage size and number of repos required for your system early on without having to pay anything
  • Push/pull images on the go without Internet connection during development phase
  • No privacy concerns: If you upload your image with your application in it you may have some sensitive data inside the image which may pose a risk if the 3rd party registry has full access to them
  • Free!

Benefits of hosted registry:

  • Hassle-free: No backups or server management

There are great Docker registries such as Docker Hub and Amazon ECR. I wouldn’t recommend usign a self-hosted registry for production. But if the price or privacy is a concern it can certainly be an option.

Creating Self-Hosted Registry

It sounds like it requires installing a server application but the nice thing about Docker is, even it is a Docker registry it can run in a container itself. So first off we pull the registry repo for Docker Hub:

docker pull registry

Now let’s create a container that will act as our registry:

docker run -d -p 5000:5000 --restart always --name registry registry

In my case the hostname of the Raspbeery Pi is

Now to test how we can push and pull images let’s download Docker’s hello-world image from Docker Hub:

docker pull hello-world

Now to push this inot our own registry running in Raspbeery Pi all we have to do is tag it with the server URL such as:

docker tag hello-world HOBBITON.local:5000/hello-world

At this point if we take look at the images on our local machine we can see the hello-world image is duplicated.

Now let’s push it to Pi:

docker push HOBBITON.local:5000/hello-world

This doesn’t work because of the following reason:

This is because the reigstry is considered to be insecure and by default it’s rejected by the client. We can confirm it’s deemed to be insecure if we check the server by running the following command:

docker info

At the bottom of the bottom we can see the localhost registry is insecure:

To address this we can add this registry to the list of insecure registries. For example in a Mac client we add go to Perefences –> Daemon and add the Raspbeery Pi registry as shown below:

After this, if we try once again to push to Pi we can see it succeded:

Now if we check the repository list on the registry again we can see the hello-world image hosted on our Pi:

Let’s now see if we can pull this image from another client.

And after pulling the image we can see it in the image list:

Resources

dev csharp, elasticsearch, docker, nest

I’ve been playing around with Elasticsearch on several occasions. This post is to organize those thoughts and experiences and show an easy way to setup ElasticSearch and start playing around with it.

Setup

Easiest way to setup Elasticsearch locally is using Docker. As of this writing the latest version of Elasticsearch is 7.2.0 and I’ll be using that in this example:

If you don’t already have the image, simply pull from Docker hub:

docker pull docker.elastic.co/elasticsearch/elasticsearch:7.2.0

For development environment suggested command to run a container is

docker run -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.2.0

which keeps it very simple and straightforward but in my workout I’d like to insert a whole bunch of data and run some queries on it and I don’t want to re-generate my data over and over again. So I decided to persist my data on host.

Persisting Elasticsearch Data

Instead of running containers one by one in the command line a better approach is to create a docker-compose.yml file file and use Docker compose to start services. I used the sample YAML file provided in official Elastic documentation

version: '2.2'
services:
  es01:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.2.0
    container_name: es01
    environment:
      - node.name=es01
      - discovery.seed_hosts=es02
      - cluster.initial_master_nodes=es01,es02
      - cluster.name=docker-cluster
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - esdata01:/usr/share/elasticsearch/data
    ports:
      - 9200:9200
    networks:
      - esnet
  es02:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.2.0
    container_name: es02
    environment:
      - node.name=es02
      - discovery.seed_hosts=es01
      - cluster.initial_master_nodes=es01,es02
      - cluster.name=docker-cluster
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - esdata02:/usr/share/elasticsearch/data
    networks:
      - esnet

volumes:
  esdata01:
    driver: local
  esdata02:
    driver: local

networks:
  esnet:

This example creates an Elasticsearch cluster with 2 nodes and uses named volumes to persist data so next time when we bring this cluster up again we should be able to continue where we left off data-wise.

Sample Application

In my previous blog post I developed a simple test data generator to generate fake bank statement data with a library called Bogus. In this project, I will use that generator to generate lots and lots of test data, insert them into Elasticsearch and have fun with it!

When you start a C# project and start looking for a library to interact with Elasticsearch, it’s a bit confusing to find out there are actually two of them: Elasticsearch.net and NEST. The gist of it is NEST is a high-level library that uses Elasticsearch.net under the hood. It also exposes low-level client so that it actually enhances Elasticsearch.net and allows using strongly typed DSL queries. In the sample application I used NEST.

Creating Elasticsearch client

Creating a client with some basic settings is straightforward:

using (var connectionSettings = new ConnectionSettings(new Uri("http://localhost:9200")))
{
    var settings = connectionSettings
        .DefaultIndex("bankstatementindex")
        .ThrowExceptions(true);
	IElasticClient elasticClient = new ElasticClient(settings);
}

Indexing data

To index a single document **IndexDocument** method can be called. However, using this method to loop through a large number of documents is not recommended.

elasticClient.IndexDocument<BankStatementLine>(testData.First());

For multiple documents, IndexMany method should be called. If the data size too large then using BulkAll method and BulkAllObservable helper is recommended.

To see the difference I created a test to index 5,000 documents with a looping over the array and using BulkAll after that. Looping over the collection took around 26 seconds whereas bulk index took only 1.2 seconds as shown in the screenshot.

Also it displays “Done” 5 times because I set the size to 1,000 and I requested 5,000 documents to be indexed so it automatically divided the load into 5 and made 5 calls:

var bulkAll = elasticClient.BulkAll(testData, x => x
                .BackOffRetries(2)
                .BackOffTime("30s")
                .RefreshOnCompleted(true)
                .MaxDegreeOfParallelism(4)
                .Size(1000));

bulkAll.Wait(TimeSpan.FromSeconds(60),
    onNext: (b) => { Console.Write("Done"); }
);

Same result can also be achieved by subscribing to BulkAll observer:

var waitHandle = new CountdownEvent(1);

bulkAll.Subscribe(new BulkAllObserver(
    onNext: (b) => { Console.Write("."); },
    onError: (e) => { throw e; },
    onCompleted: () => waitHandle.Signal()
));

waitHandle.Wait();

Showing progress

In the sample code below I showed displaying progress using onNext action delegate:

var testData = dataGen.Generate(statementConfig.StartDate, statementConfig.EndDate, statementConfig.OpeningBalance, statementConfig.DebitTransactionRatio, statementConfig.TransactionDateInterval, statementConfig.NumberOfStatementLines);
var cancellationToken = new CancellationToken();
var batchSize = 250;
var bulkAll = elasticClient.BulkAll(testData, x => x
    .BackOffRetries(2)
    .BackOffTime("30s")
    .RefreshOnCompleted(true)
    .MaxDegreeOfParallelism(4)
    .Size(batchSize), cancellationToken);
var totalIndexed = 0;
var stopWatch = new Stopwatch();
stopWatch.Start();
bulkAll.Wait(TimeSpan.FromSeconds(60),
    onNext: (b) =>
    {
        totalIndexed += batchSize;
        Console.WriteLine($"Total indexed documents: {totalIndexed}");
    }
);

and the output looked like this:

Even though the numbers seem a bit wonky I think it’s a good example to illustrate the multi-threaded nature of BulkAll. Because I set the maximum degree of paralleism to 4 and first 1,000 were inserted in a mixed order suggesting that they were running in parallel.

Cancellation with bulk operations

BulkAll observer can also be cancelled for longer processes if necessary. The code excerpt below shows the relevant pieces to cancellation

var cancellationTokenSource = new CancellationTokenSource();
var cancellationToken = cancellationTokenSource.Token;
var batchSize = 250;
var bulkAll = elasticClient.BulkAll(testData, x => x
    .BackOffRetries(2)
    .BackOffTime("30s")
    .RefreshOnCompleted(true)
    .MaxDegreeOfParallelism(4)
    .Size(batchSize), cancellationToken);
var totalIndexed = 0;
var stopWatch = new Stopwatch();
stopWatch.Start();
Task.Factory.StartNew(() =>
    {
        Console.WriteLine("Started monitor thread");
        var cancelled = false;
        while (!cancelled)
        {
            if (stopWatch.Elapsed >= TimeSpan.FromSeconds(60))
            {
                if (cancellationToken.CanBeCanceled)
                {
                    Console.WriteLine($"Cancelling. Elapsed time: {stopWatch.Elapsed.ToString("mm\\:ss\\.ff")}");
                    cancellationTokenSource.Cancel();
                    cancelled = true;
                }
            }

            Thread.Sleep(100);
        }
    }
);

try
{
    bulkAll.Wait(TimeSpan.FromSeconds(60),
        onNext: (b) =>
        {
            totalIndexed += batchSize;
            Console.WriteLine($"Total indexed documents: {totalIndexed}");
        }
    );
}
catch (OperationCanceledException e)
{
    Console.WriteLine($"Taking longer than allowed. Cancelled.");
}

Querying Data

Querying data can be done by calling Search method of ElasticsearchClient. Here’s a few examples below. There are more in the sample accompanying source code:

// Get the first 100 documents
var searchResponse = elasticClient.Search<BankStatementLine>(s => s
    .Query(q => q
        .MatchAll()
    )
    .Size(100)
);
// Get transactions with date between 01/01/2018 and 10/01/2018
var searchResponse = elasticClient.Search<BankStatementLine>(s => s
    .Query(q => q
        .DateRange(x => x
            .Field(f => f.TransactionDate)
            .GreaterThanOrEquals(new DateTime(2018, 01, 01))
            .LessThanOrEquals(new DateTime(2018, 01, 10))
        )
    )
    .Size(10000)
);

Deleting data

For my tests I had to delete all frequently and it can be achieved by running the query below:

elasticClient.DeleteByQuery<BankStatementLine>(del => del
    .Query(q => q.QueryString(qs => qs.Query("*")))
);

Source Code

Sample application can be found under blog/ElasticsearchWorkout folder in the repository.

Resources