Visuomotor robot arm coordination : Step 2 – Build the model

At the previous step I collected data using my robot arm and the webcams I installed on it. The data contains the detected object coordinates on the left and right image, the robot arm joint angles where it detected the object and the same variables but when the arm is placed manually on the object.

So, for each detected object (i.e. database record), I recorded these values we have this :

  • Position of object on the left image (xL, yL)
  • Position of object on the right image (xR, yR)
  • Arm joint angles where the object has been seen : (alphaI, betaI, thetaI, gammaI)
  • Arm joint angles at the object location : (alphaT, betaT, thetaT, gammaT)

Now the goal is to build a model that will predict the arm joint angles that will place the arm at the location of the object. To make this prediction, the model inputs will be the object position on each image and the current arm joint angles.

I started by trying to analyze the data I gathered to determine its properties. I quickly found that the data won’t be linear, so that I’ll need a non linear model to map the data. I’m used to neural networks, they are powerful non linear models, so I decided to implement a framework that will help me make the model learn the data.

The learning process

To ease the learning process using the DeMIMOI models, I decided to proceed in 4 steps :

  1. Prepare and preprocess the data to be learnt
    The idea is to build a specific dataset out of the raw data. For example if I have to make a mathematical operation on the raw data, say a subtraction or even a sine.
    The resulting data is stored in a temporary storage (the PC RAM in my case).
  2. Teach the model
    The learning algorithm uses the temporary storage to feed the model with the data to learn.
  3. Plot the results of the obtained model
    At this step, you want to see what the model learnt, so you just plot the output of the model and the desired output together to see the error.
  4. Save the model
    Once the model fulfill the needs, it can be saved for a later use.

Learning process implementation

For the first three steps, I created three DeMIMOI_Collection that each holds the required elements of each step.
Now I’m going to show you each of them and describe what’s going on.

  1. Data preparation and preprocessing
    Visuo Motor Coordination - Data PreprocessingThis step requires raw data. The data comes from the Long Term Memory block which hides a MongoDB database. The raw data flowing out of this block is then directly fed to the STM (Short Term Memory) block. On this particular case, I could have sent the raw data directly to the model,but I needed to make some calculations before, which where then removed…
  2. Teaching the model
    Visuo Motor Coordination - Learning StepThis step is the learning step. The data created at the previous step is fed to the Neural Network learning algorithm. Before sending the data to this block, we must make some data scaling since neural networks need [0, 1] or [-1, 1] normalized data depending on the activation function. This is why I put two normalization blocks, one on the inputs and the other one on the outputs.
    On this picture, we clearly see what are the inputs and the desired outputs we want the model to learn.
  3. Plotting the results
    Visuo Motor Coordination - Data PlottingOnce the model is built and ready to operate, I want to see the results of the learning step. I reuse the STM data as input, the data normalizer to ensure data normalization, then the neural network gets normalized data, that is in turn normalized back to output units. Finally this data is plotted by the DeMIMOI_Chart0 that is mapped to the Windows Forms Chart control I placed on the form.
    This results to this interface :
    Visuo Motor Coordination - Application Learning StepWe can see that for each graph, the neural network outputs are really close to the desired outputs, which means that it converged to a « good » solution.
    The left textbox contains Graphviz code to draw the system structure I showed all along this article. The right one contains the Graphviz neural network structure.

Now, it seems that we managed to get a model that is able to predict the end effector position to place it on the object that has been seen. So now, that’s the exciting part : let’s build the application that will demonstrate this ability in real time ! It will be the subject of my next post !


Visuomotor robot arm coordination : Step 1 – Stereovision object recognition

Ok, so I want to build a system that is able to pick up an object it has seen. I called that project Visuomotor coordination because I’m willing to use the robot webcams to be used to estimate the movement that the robot has to make to reach the object seen. If you read this Wikipedia article, you’ll get what I mean.

So first step is to detect the object. I assume that the system knows what the object to pick up is. I assume it has the required data on this object so that it will be able to find it on the scene. As a vision system, I’ll use the two webcams I have on my robot arm. The idea on using both of these is to benefit from the stereo vision to extract some depth information to determine how far the object is from the robot arm gripper.

Object detection – Algorithms

There’s numerous ways to do object detection… There’s a lot of detection/recognition algorithms down there ! You can give a look at that OpenCV tutorial to have a better idea of what they are and their features.

I made some tests and read some documents and finally chose the SURF algorithm because it shows pretty good detection performance and also because it’s possible to make it faster by running it on the GPU instead of the CPU. For the cons, one should notice that it’s a commercial non free library, except for research purpose and personnal use. So no worries for my application but should keep that in mind though…

Stereo object detection : my idea

My idea was the following : use the SURF algorithm by starting from the tutorial sample demonstrating Feature Matching. This sample shows how to detect an object by matching it with a reference image of that same object. This will provide me with a way to detect the object in the scene using one of the two webcam (say the left one). Then my idea is to use that same algorithm to find the corresponding features in the second webcam image.
This way, I’ll obtain two point clouds (one for each webcam) that are mapped to the object I want to detect. Then by calculating the mean of each point cloud, I’ll get the center of the object seen on each image. So, in the end I have the object position on both images.

Let’s summarize the stereo object detection algorithm I’ll program :

  1. The inputs :
    1. Two webcam images (left webcam, right webcam)
    2. A reference image of the object to be found
  2. The function block :
    1. Extract SURF features from the reference image
    2. Extract SURF features from the left webcam image
    3. Match the features to find where the object is on the left webcam image
    4. Extract SURF features from the right webcam image
    5. Match the features found on step C to find where the object is on the right image
  3. The outputs :
    1. Point cloud of the object on the left image
    2. Point cloud of the object on the right image

After spending some hard time to understand the EmguCV functions and implementing some of mine, I finally came to a stable algorithm that is able to detect an object on both webcams and outputs the point clouds.

The results :

To develop this idea I created a Visual Studio project in which I experimented and put together all the blocks I needed to reach my goal. I ended with an application that shows the two webcam images with little circles and lines that clearly shows what have been found.

Here is a screenshot of the application detecting the object.

Stereo vision feature extraction application. Red points are instantaneous detected points, the blue ones are the average points found from the instantaneous points.

Stereo vision feature extraction application. Red points are instantaneous detected points, the blue ones are the average points found from the instantaneous points.

What you can see here, is an image formed of the concatenation of the left and right webcam. The object is a small wood cube on which I drew lines to make the object easier to detect. The blue points are the averaged points on each view, calculated from the instantaneous detected points. The red ones are the instantaneous detected features that belong to the reference image as well as the left and right image from the webcams.

The buttons are :

  • Take snapshot : to take a snapshot of the object to get the reference image. This image will then be used as a reference image. I also manually create a mask to specify the algorithm what exactly is my object in the scene.
  • Reset object position : this button resets the average position to restart from the newly detected points
  • Remember object position : this stores the object position as well a the robot arm position at which the object is being seen
  • Remember arm target position : it stores the object position (X and Y for each image), the robot arm position at which the object has been detected and the robot arm position when it’s on the object in a MongoDB database. For this step, once the object has been detected, I manually move the arm to be on the object, ready to grip it. In a sense, I demonstrate it what it should do.
  • Pause/resume object detection : as its name says, it pauses or resumes the object detection loop

Using this application and process, I obtained a database that is composed of many records like this one :

This is a view of one record created by the application

This is a view of one record created by the application

Now that I have that database, I can start to train a model that will calculate the arm final position thanks to the starting position and detected object position ! Now, this is really going to be interesting as I’ll see what the model is able to do ! Reach the object or not ?

In a nutshell

The idea is to make my robot arm to be able to reach an object it has seen with its webcams. I made an application that extracts features from a reference image and detect these same features in the left and right webcam image. This allows me to get the position of the object in each image. Then I built a database containg the object position on the images, the robot arm position when the object has been detected and the arm position when it’s on the object.

My next step will be to create a model that calculates the arm position that will make the arm on the object using the current robot arm position and the detected object position on the webcam images.

This will be the object of my next post, so stay tuned !

Stereo vision color calibration

    Last time, I noticed that my webcams didn’t have the same color response. One is white clear, the other is red tinted…
This difference in color is not desirable when trying to apply stereo vision algorithms such as disparity mapping. It requires two images with the same aspect to allow matching algorithm to work best.

    I read an article from Afshin Sepehri called Color Calibration of Stereo Camera that describes in details a technique to deal with this problem.

About the article

    The article details a technique that allows correcting the stereo image couple by applying a mathematical function on each pixel to change their color to a calibrated one, which is the true color.

    He uses random test patterns composed of colored squares which are then printed to be viewed by the webcams to calibrate.

Afshin’s example test pattern

    Then, he has the true color, left viewed color and right viewed color. He applies a minimization algorithm to find the mathematical function to apply on the left and right image pixels to calibrate each image to obtain true and identical colors.

For the minimization process he has to approaches :

    1. Assuming that each true color component only depends on the same component : Rc = f(Rf) , Gc = f(Gf) , Bc = f(Bf)
    2. Assuming that each true color component depend on all the other components : Rc = f 1(Rf, Gf, Bf)   ,   Gc = f2 (Rf, Gf, Bf)   ,   Bc = f3 (Rf, Gf, Bf)

    After testing both, he concludes that the best one is the second one. He shows his results and it seems to be quite interesting !

    I was quite furstrated by the fact that the article does not show the outcome of this technique on the disparity map output to see if it’s really relevant or not… That’s why I decided to give this a try and implement my own color calibration algorithm based on Afshin’s article.

My implementation idea

    My approach is mainly the same than Afshin’s except that I decided to build a relative correcting algorithm whereas Afshin’s algorithm is an absolut one.
He tries to obtain the true colors on his final images which has some issues :

    • After printing the test patterns, there may be some color errors on the colors due to printer color calibration itself
    • Scene lighting and light reflections may change the webcam perception of the color and introduce errors
    • There’s two models, one for each image, so it may require high computing power to run it online in real time

My idea is based on these points :

    • One of the two webcam image is considered the reference, the calibration algorithm will have to correct the other image to be as close as possible from the reference
    • The algorithm does not give the true colors but we have both image colors calibrated

Implementation steps

On the programming part, I had the following points to code :

    • Random test chessboard pattern generator : generate one or more random test patterns as image files to be printed
    • Test pattern finder : algorithm to find and extract the chessboard from an image
    • Color pattern extraction : locate the colored squares and make an average of the pixels it contains. It gives a list of the colors of each square.
    • Minimization process : algorithm to build the model by finding a function to transform the old (uncalibrated) pixel colors to new calibrated pixel colors. It gives a transformation matrix that then can be saved to a file and loaded on another program, same as we do with extrinsics and intrinsics (see previous article)
    • Image transformation algorithm : point an image on the input and take the calibrated image on the output

Final result

Once all these point programmed, it gives a quite interesting result ! Here is a quick preview :


    On the top left, you can see the uncalibrated raw images from the webcams, on the top right, it’s the chessboards extracted from the stereo images.
On the bottom left, you can see the calibrated images output.

    On this sample application, I calibrated the right image using the left image as a reference. So, you can see that the right image before calibration is a little reddish and once it has been processed, it’s clearer and corresponds really well to the one on the left !
Mission cleared !

    Oh, I must mention though, that it runs quite slowly on my computer configuration as all this is running on a virtual machine. The VM runs on my 2007 Dell Inspiron 1520 laptop which begins to lack some processing capacity…
I’m currently considering buying a desktop computer with some « hardcore gamer » features… The problem is that it’s quite expensive… Hard choice !

Stereo matching – first trials

We now have the webcam images, left and right. Let’s apply the stereo matchnig algorithm.


Before doing anything, we first have to calibrate the webcams.
I tried a little without to see what it really does when there’s no calibration : it works, but the results depend on how well you manually turned the webcams to have the images aligned.
It also greatly depends on how good the webcam lens are, since they deform the image. It’s not something you can clearly see by looking at the images.
It’s only once you run the calibration that you can figure out how distorted your images are and that’s quite astonishing !

So what calibration does ?
It runs some image modifications to map the images to a zone on which both images are perfectly planar and aligned.
It may require cropping a little your images because it gives some distorted borders as it applies a deformation algorithm on each image.
As an example, here is the output from the OpenCV stereo_match sample application.


Now, we have the assurance that both images we get from this algorithm are well calibrated and perfectly aligned. It makes a big difference on the resulting disparity map.

My implementation idea

As I explained, I need my webcams to be calibrated.
I spent some time searching for how I can do it and the easiest way I could find is using the stereo_calib sample from OpenCV framework.
Unfortunately, EmguCV does not give samples we could play with to calibrate the webcams.
OpenCV has a lot of samples including two apps called stereo_calib and stereo_match which are perfect for what we’d like to do !

The stereo_calib program uses many couples of stereo images of a chessboard to calculate a few matrices that are stored on two files named intrinsics.yml and extrinsics.yml.These files are then used by the final application to calculate the image deformation and then obtain calibrated images.
This part is shown on the stereo_match sample from OpenCV.

On the implementation part, I think it’s better to stay as close of the OpenCV environment as possible.
The idea is to use stereo_calib sample application to calibrate the webcams and generate the YML files.
Then, on my C# program, load these files and compute the image deformations by porting some parts of the stereo_match sample application using EmguCV.
Of course, we could do the same for the stereo_calib sample, port it to C# instead of directly using it from OpenCV… I found this sample quite complete and it does exactly what I required so, I chose not to waste my time on it.

Installing OpenCV

So, as I said before, we have to install OpenCV to access its sample apps.
Since I found it not that easy to do without prior knowledge, I’m about to explain the main steps I went through. If it helps…

You first have to download the latest OpenCV package :

Once you got it, install it.
If you don’t have it, download and install CMake :

On Windows (probably the same for other OS), open a terminal window, navigate to where you installed OpenCV, navigate to the samples folder and then run Cmake on it. It should build the samples. Be aware that it may take some time to compile…

Once it’s built, you may now have on the Build folder, the OpenCV samples as executables.

Running the calibration

With OpenCV, there’s stereo images called right–.jpg and left–.jpg with — going from 01 to 13. There’s also a file called stereo_calib.xml which contains the image list of these files.
So you can copy these files and paste it to the same directory as the stereo_calib executables. You can now run stereo_calib and see what it gives !

Final result

Ok, it’s working now ! I ran the sample app with my own stereo images and got the extrinsics and intrinsics files which I then loaded into my program and it gives this :


That’s pretty encouraging ! We can clearly see the shapes of lamp and distinguish the helicopter !

That’s quite good, but I would like to go a bit further in the subject. I noticed that the cameras don’t have the same color rendering, as you can see from the screenshot. Sometimes it’s even worse, depending on how one camera chose to adapt to the surroundings (ambient light, focus and so on). I saw that it can significantly affect the stereo matching results.

So what I’ll try to do is lock some parameters of the camera starting by exposure which keeps on changing the brightness of the image.
I’m also about to try some color calibration algorithms.
I’ll then be able to see what it does on the quality of the stereo matching results.

Getting the images from the webcams

Ok, so now that we have a robot stereo vision enabled, we’d like to see what it sees right ?

Choosing the framework

Let’s start ! Getting images from webcams is quite easy with our favorite libraries ! The best one in my opinion to do so is Aforge (you could of course use which also refers to Aforge dlls).

The reason for this choice is that with the framework you can control some camera parameters such as exposure, zoom, pan, tilt, focus and some others.
Controlling these parameters depends on the capabilities of your webcam.
In my case, with the C525 from Logitech, I can control at least focus, zoom, pan and exposure.
That’s quite interesting since I may want to leave them in the auto mode or maybe to control them to set them to a value I found is good for my application.

Using the framework to get the images

Using Aforge to get image streams from webcam is quite easy and it’s almost already done for you !
As you can see on the following link, there’s a sample application that shows you how to get and display two webcam streams :

Improving and customizing the framework functions

I found it easy but not complete. It gives you the connected cameras in an indexed list which is not necessarily ordered the same way each time you run your program (maybe depending on which webcam you plugged first the very first time you plug it to a new computer…).
For example if you have a stereo webcam configuration, you don’t want your cameras to be inverted in any way since it may cause your application not to work anymore…
Moreover, you cannot save camera settings for the next time you run your app, to automatically recover them.
Finally sometimes, setting up the environnement (objects to use, events, setting up camera…) can be a bit tricky.

I then decided to code a class to avoid these issues.
I can share the code if somebody is interested.

Rotating the images

As a result from the webcam positions and orientations on the robot we have to rotate the images to have them the right way using the RotateFlip function from Aforge.
In my configuration, the left cam is rotated by 270° and the right one is rotated by 90°.

Final result

And here is the final result !

Left imageRight image

Perfect ! We now have the images from the webcams !
I had to properly turn the webcams to have a quite well « calibrated » image.
I mean, the images are visually aligned as the top of the light is on nearly the same height in both images and the webcams seems to be planar.

We are now ready for applying the stereo imaging algorithm !
I’ll make a special post on this topic.

Interesting article about a new vision algorithm

That’s quite curious as I’m just resuming on my projects with vision stuff that an article about some guys from the MIT just found a new vision algorithm ! Coincidence ? Mmmh ?

So here is the article :

Here is the French version (I didn’t forget ya !) :

In my mind, I think it seems to be quite interesting ! It extracts more data from what’s already available today. But well, the point is that it seems to require some laser pointing/measurement system which generally costs a lot of money and is pretty hard to embed on a small system.
Moreover, the article does not mention how long it takes for the algorithm to spit the results out… So we currently don’t know if it is going to be working « realtime » or not…

We have to wait for the presentation in June to have more details probably… Except that, it looks great and it may bring some new stuff in the vision algorithms though !

Time will tell !