March 11, 2010 Education, Pedagogic Practice No comments
March 11, 2010 Education, Pedagogic Practice No comments
Привітання та дуже загальний вступ
Привіт. Мене звати Андрій Будай і сьогодні я вестиму цю пару, темою якої є k-means алгорим. Some parts will be in English, so please do not scare if I switch to English unexpectedly, like I just did.
As you all know we are talking about the clustering of data. Human brains are very good at finding similarity in things. For example, biologists have discovered that things belong to one of two types: things are either brown and run away or things are green and don’t run away. The first group they call animals, and the second is plants.
Процес групування речей по певних ознаках ми називаємо кластеризацією. Якщо ми говоримо про то, що біологи почали поділяти тварин на ряди, роди та види, то ми говоримо про ієрархічну класифікацію. Яку я передбачаю ви вже розглянули, зоокрема підхід ієрархічної кластиризації згори-вниз та знизу-вгору.
Розуміння Partitioning алгоритмів
А що якщо ми хочемо групувати дані не в порядку збільшення або зменшення об”єму груп, якщо ми хочемо погрупувати N векторів даних на K кластерів за один логічний хід? Такий різновид кластерізації називається Partitioning алгоритмами. Іншими словами ми розбиваємо множину.
Ще раз нагадаємо що таке розбиття множини:
Система множин S={X1 … Xn} називається розбиттям множини M, якщо ця система задовольняє наступним умовам:
Partitioning алгоритми розбивають множину на певну кількість груп за один крок, проте результати покращуються із процесом ітерування. До найвідоміших partitioning алгоритмів відносяться k-means та його модифікації із доповненнями, quality threshold алгоритм також для кластеризації можуть бути використані Locality-sensitive hashing та Graph-theoretic methods. Якщо буде час то ми поговорими і про них також.
K-means алгоритм
Отже, k-means. Насправді алгорим не є складним і я сподіваюся що всі із вас будуть розуміти суть алгоритму та будуть мати палке бажання його реалізувати. Це моя мета на сьогоднішню пару. []
Завданням алгоритму є здійснити розбиття N векторів даних на K груп. Причому число k є попередньо відомим – або є вхідним параметром алгоритму як і дані.
Опис
[Цю частину я збираюся продиктувати]
Маючи список даних спостережень (x1, x2, …, xn), де кожен x є вектором розмірності d, k-means кластеризація призначена для розбиття цього списку із n спостережень на k підмножин (k < n), в результаті чого ми повиння отримати таке розбиття S={S1, S2, …, Sk}, в якому внутрішньокластерна сума квадратів норм є мінімізованою:
де μi є центром Si.
Кроки алгоритму
Перш за все нам слід мати множину із k центрів m1(1),…,mk(1), які можуть бути визначені або випадковим чином, або евристичним шляхом. Алгоритм складається із двох основних кроків – кроку присвоєння та оновлення.
Демонстрація
[here I’m going to switch to English and demonstrate steps of the algo at board]
Оскільки алгоритм є емпіристичним, то немає гаранції, що він досягне глобального мінімуму. А також результат залежить від початкових вхідних кластерів. Оскільки алгоритм є досить швидким, то є така практика, як прогонка алгоритму деяку кількість раз із різними початковими центрами.
Other variations of the K-means algorithm
Because of some disadvantages of standard algorithm we’ve got few
different variations of it like Soft K-means, when data element is
fuzzy associated with mean, or like K-means++, which is additions to
the K-means to initialize cluster means better.
K-means++
With the intuition of spreading the k initial cluster centers away from
each other, the first cluster center is chosen uniformly at random from
the data points that are being clustered, after which each subsequent
cluster center is chosen from the remaining data points with
probability proportional to its distance squared to the point’s closest
cluster center.
Soft K-means алгоритм
В Soft K-means алгоритму кожен вектор має часткову приналежність до якогось кластеру, так як же і в розмитій логіці – існуть не тільки 1 та 0, а ще й 0.667. Таким чином, вектори, що знаходяться ближче до центру кластеру мають більшу приналежність до нього, а ті які знаходяться по краях мають дуже малу приналежність. Отже, для кожного вектора x ми маємо коефіцієнт який відповідає за приналежність до k-того кластеру uk(x). Звичайно, що сума коефіцієнтів для заданого вектора x є рівною 1:
Таким чином, центр кластеру вираховується як середнє всіх векторів, домножених на цей коефіцієнт приналежності до кластеру:
Коефіцієнт є обернено пропорційний до відстані до центру кластеру:
опісля коефіцієнти нормалізуються із використання параметра m > 1 щоб сума була рівною 1 за такою формулою:
Коли m
близьке до 1, кластери ближчі до центру мають більшу приналежність, а ті що дальше мають меншу приналежність. Для m = 2, це відповідає лінійній нормалізації коефіцієнтів, щоб отримати суму 1 і алгоритм подібний до звичайного k-means.
Кроки цього алгоритму є дуже подібними до звичайного:
Цей алгоритм також має деякі проблеми що й у звичайному алгоритму, зокрема залежність від початково вибраних кластерів, та не обов”язкове знаходження глобального мінімуму.
Можливий підхід до реалізації
Обов’язково розділяємо класи для зчитуванна та утримання даних, із класом, що буде здійснювати кроки алгоритму, якщо планується графічне зображення то воно також має бути відділене. Залишаємо можливість обирати норму.
[Не дуже хочу закидати реалізацію, щоб студенти завчасно не використали її]
He-he! I’ve got it in about 3.5 hours.
I would say that core algorithm code looks very good, at least take a look at method Process:
public void Process()
{
Initialize();
while (!ClustersRemainTheSame())
{
AssignStep();
UpdateStep();
}
}
I’ve also used realized algorithm in two applications – one is clusterization of animals by theirs properties and another is graphical demoing usage. Below is nice screenshoot:
Application
[If I’ll see that we are ok with English this part will be in it, otherwise we will have this in Ukrainian.]
“The k-means clustering algorithm is commonly used in computer vision as a form of image segmentation. The results of the segmentation are used to aid border detection and object recognition. In this context, the standard Euclidean distance is usually insufficient in forming the clusters. Instead, a weighted distance measure utilizing pixel coordinates, RGB pixel color and/or intensity, and image texture is commonly used“
One of the application examples is clustering colors in image to compress image. A good article on this is written up here.
Завершення пари
Побажання зробити реалізацію алгоритму
Час на питання
P.S. So far I have some good plan for Friday… not 100% complete, but I’m satisfied with what I have. And tomorrow I’m going to work on this once again. Also not really good that I just translated a lot of stuff from wiki pages.
March 9, 2010 Education, Pedagogic Practice, Ukrainian No comments
On Friday, I will have my first teaching experience (or at least teaching students of my University). This is part of the Master Year pedagogic practice. Subject is called “Data Mining” and theme of Friday’s lesson will be “K-means” algorithm.
Few things here before I start
I will have this post mixed of English and Ukrainian. Why? Because the way I’ll be presenting algorithm in both languages and because I need some report in Ukrainian, so do not want to double write a lot – I’m lazy ass as someone said.
План уроку (Agenda)
1. Привітання та дуже загальний вступ
Даю знати хто я і про що буду розповідати. Also I have a nice joke in English that could grab attention of the students. Hope they will be able to recognize and understand my English.
2. Розуміння Partitioning алгоритмів
Тут я нагадаю що таке Cluster analysis та згадую про Hierarchical і Partitional різновиди. І дещо про різниці між ними.
3. K-means алгоритм
– завдання алгоритму
– введення позначень
– кроки алгоритму
– демонстрація алгоритму як на дошці так і на комп”ютері
– переваги та недоліки
4. Other variations of the K-means algorithm
Because of some disadvantages of standard algorithm we’ve got few different variations of it like Soft K-means, when data element is fuzzy associated with mean, or like K-means++, which is additions to the K-means to initialize cluster means better.
5. Soft K-means алгоритм
– різниця в кроках алгоритму
6. Можливий підхід до реалізації (можливо напишу програму ще раз)
7. Application
If I’ll see that we are ok with English this part will be in it, otherwise we will have this in Ukrainian.
One of the application examples is clustering colors in image to compress image.
8. Побажання зробити реалізацію алгоритму
9. Час на питання
I hope tomorrow I’ll have article with whole lesson text, more structured and with examples, also hope to write k-means plus demo from scratch. Will see how fast am I.
March 3, 2010 IDE, WCF, Workaround No comments
If you have no “Edit WCF Configuration” in the context menu on right click on the App.config.
I’m talking about this one:
Go to Tools -> WCF Service Configuration Editor
Open it and then close.
This is funny and strange solution to our issue.
March 3, 2010 Revision of my activity, Success No comments
March 3, 2010 KohonenAlgorithm, MasterDiploma No comments
I already have some implementation of the parallelization in algorithm of SOM.
Let us quickly remind the main steps of the classical Kohonen learning algorithm.
As you see algorithm could be very expensive if we have big amounts of data.
Also, steps 3 and 4 takes the most of time, what if we execute 2-3-5 in separate threads? Yes, we could do this to some extend. Main issue is when we have overlapping of affected area by two best matched neurons wich we found in separate threads.
I’m bit tired to write a lot of explanations of this algorithm, so I prepared 3 images that says for themselves. Hope will have detailed explanations and different tuning things for this soon.
February 28, 2010 .NET, AutoMapper, Frameworks 4 comments
After I’ve posted this article one guy was really concerned about the performance of AutoMapper.
So, I have decided to measure execution time of the AutoMapper mapping and manual mapping code.
First of all I have code which returns me 100000 almost random Customers which goes to the customers list.
Measurement of AutoMapper mapping time:
Measurement of the manual mapping time:
I ran my tests many times and one of the possible outputs could be:
AutoMapper: 2117
Manual Mapping: 293
It looks like manual mapping is 7 times faster than automatical. But hey, it took 2 sec to map handrend thouthands of customers.
It is one of the situations where you should decide if the performance is so critical for you or no. I don’t think that there are a lot of cases when you really need to choice manual mapping exactly because of performance issue.
February 27, 2010 .NET, AutoMapper, Clean Code 2 comments
If you haven’t read my previous article on AutoMapper, please read it first.
I just read an article on the noisiness which could be introduced with AutoMapper.
My Mapping Code is Noisy
Accordingly to that article my code:
CustomerViewItem customerViewItem = Mapper.Map<Customer, CustomerViewItem>(customer);
is noisy code.
Why?
In few words that is because I’m coupled to the specific mapping engine and because I’m providing too much unneeded information on about how to map my objects.
Take a look on that line once again: we used Customer , but we already have customer, so we can know its type.
What to do to make it not so noisy?
In that article, that I pointed, there is proposition to change my code to:
CustomerViewItem customerViewItem = customer.Map<CustomerViewItem>();
So, I’ve got interested to apply it to my previous AutoMapper article’s code. As you already caught Map is just extension method for my Customer class.
Here it is:
Everything becomes simple if you are trying it.
February 26, 2010 .NET, C# No comments
I have never used constraints for my generic classes. So in this article I just want to show how I’ve used it:
So for now I have ICustomer interface:
and two implementations like UsualCustomer and VipCustomer.
Also I’m going to have some OrderingProcessor which is supposed to be operating with different types of Customers. But when I want to call GetOrders method I just have not it in my list, like below:
but once I’ve changed the declaration of the OrderingProcess just a little bit, with specifying type of T
World has smiled to me and I’m getting what I want, please see:
I expect that you are laughing while reading this post. Maybe because this thing is too simple to write article/post on it if you are familiar with C#/.NET. Yes, but I have never tried it myself. Did you?
To make you laughing not so much loud let us take a look on another architecture:
As you see we can create specific VipOrderingProcessor and we are constraining it to use just VipCustomers or derived classes. This gives us opportunity to specify constraints on different levels of implementation.
But do you know what does another constraint new() mean?
This constraint specifies that TCustomer objects will have public default constructor, so you can write
var vipCustomer = new TCustomer();
This could be very useful for implementing Factory Design Pattern. I’m sure that next time I will need to develop whole family of Factories I will use constraints, since code could be much flexible in this case.
February 25, 2010 .NET, AutoMapper, Frameworks 4 comments
The Problem
Have you ever faced need to write code, which looks like here:
Our scenario could be like:
We have our domain model which has Customer entity and we are going to show Customers in DataGrid, and for that we need much lighter object CustomerViewItem, list of which is bounded to the Grid.
As you see there are four lines of code which are just copying values from one object to another. Could also be that you will need to show up to 10-15 columns in you grid. What then?
Would you like to have something that will do mapping from Customer to the CustomerViewItem automatically?
Of course you do, especially if you have another situation like mapping of heavy data objects into DTO objects which are considered to be send though the wire.
AutoMapper
From the AutoMapper codeplex web page we see that “AutoMapper is an object-object mapper. Object-object mapping works by transforming an input object of one type into an output object of a different type. What makes AutoMapper interesting is that it provides some interesting conventions to take the dirty work out of figuring out how to map type A to type B. As long as type B follows AutoMapper’s established convention, almost zero configuration is needed to map two types.“, so in other words it provides the solution for our problem.
To get started go and download it here. It is standalone assembly, so you should not get difficulties with including reference to it in your project.
So in order to ask AutoMapper do dirty work instead of me, we need to add next line somewhere in the start of our code execution:
Mapper.CreateMap<Customer, CustomerViewItem>();
Once we have that we are done and we can use pretty nice code to get our mapped object:
Lets take a look on whole code base to see all about what I’m going to talk further:
And proving that all values has been mapped take a look on the picture:
More Complex Example 1 (Custom Map)
So far we know all to have extremely simple mapping. But what if we need something more complex, for example, CustomerViewItem should have FullName, which consists with First and Last names of Customer?
After I just added public string FullName { get; set; } to the CustomerViewItem and run my application in debug I got null value in that property. That is fine and is because AutoMapper doesn’t see any FullName property in Customer class. In order to “open eyes” all you need is just to change a bit our CreateMap process:
And results are immediately:
More Complex Example 2 (Flattening)
What if you have property Company of type Company:
and want to map it into CompanyName of the view class
How do you think what do you need to change in your mapping to make this work?
Answer: NOTHING. AutoMapper goes in deep of your classes and if names matches it will do mapping for you.
More Complex Example 3 (Custom type resolvers)
What if you have boolean property VIP in your Customer class:
and want to map it into string VIP and represent like “Y” or “N” instead
Well, we can solve this the same way we did for the FullName, but more appropriate way is to use custom resolvers.
So lets create customer resolver which will resolve VIP issue for us.
It looks like:
And only one line is needed for our CreateMap process:
.ForMember(cv => cv.VIP, m => m.ResolveUsing<VIPResolver>().FromMember(x => x.VIP));
More Complex Example 4 (Custom Formatters)
What if I want AutoMapper to use my custom formatting of the DateTime instead of just using ToString, when it does mapping from DateTime to String property?
Let say I want use ToLongDateString method to show birth date with fashion.
For that we add:
And making sure that AutoMapper knows where to use it:
.ForMember(cv => cv.DateOfBirth, m => m.AddFormatter<DateFormatter>());
So, now I’ve got:
Great, isn’t it? BirthDate is even in my native language.
I hope my article was interesting to read and it gave you ideas how you can utilize this new feature called “AutoMapper”.
February 23, 2010 Errors, IDE, VS2010 4 comments
I have found that not only my project has a lot of stupid bugs with more unexpected fixes, but also some well known products like Visual Studio.
Today I tried to install VS2010 RC on the XP system. Before that I had Beta 1 installed, but now it requires Service Pack 3.
So I installed Service Pack 3, deinstalled Beta 1 and was expecting to install VS2010 RC without any pain.
But what happen was interesting:
Installation process runs… and then just dies without explanations.
Playing with it I’ve got this
Error message:
To help protect your computer, Windows has closed this program. Name: Suite Integration Toolkit Executable
I searched over the internet how to fix this. And I found it:
Go to input languages settings and do the next for each input language you have:
I could not imagine that someone is able to guess such strange fix itself. :)
Of course there are explanations here and here.