6/16/2006

看球学日语:gohan

日本对澳大利亚那场,听到最多的是gohan...两个字.
解说员,赛后的队员记者会.大家都在说.

我就奇怪了为什么揪着"御饭"(也就是大米饭)不放呢?
难道是在德国受到了不公平的待遇,没给吃饱饭?这个照理说,是中国队经常得到的待遇啊.原来是连带着亚洲球队全都被歧视...

嗯...存疑

直到今天,又听到gohan,才突然天雷地火电闪雷鸣地明白.人家!那个!应该是:
后半!!!后半场的简称.


服了我自己...

6/14/2006

NOTE: String distance metrics

for
William W. Cohen, Pradeep Ravikumar, S. E. F. (2003). A comparison of string metrics for matching names and records. the Workshop on Data Cleaning and Object Consoliation.

and
http://secondstring.sourceforge.net

1.edit distance: (the differences in position matters)

  • Levenstein distance
  • Monge-Elkan distance
  • Smith-Waterman distance
  • Jaro smilarity distance

2.token based ditance (strings are deemed as multisets of words)
  • Jaccard similarity
  • TFIDF (cosine similarity)
  • Jensen-shannon distance
  • FS (fellegi and sunter) distance

3.hybrid distance (a combination of token-based and string based metrics)
  • a variant of Monge-Elkan distance
  • softTFIDF

6/13/2006

NOTE: Blocking methods for record linkage

1. record linkage:
http://en.wikipedia.org/wiki/Record_linkage
Record linkage also known as deduplication, refers to the task of finding entries that refer to the same entity in two or more files. Record linkage is an appropriate technique when you have to join data sets that do not have a unique database key in common. A data set that have been through Record linkage is said to be linked.

2.Blocking methods
are used in record linkage systems to reduce the number of candidate record comparison pairs to a feasible number whilst still maintaining linkage accuracy.

Blocking methods partition the data sets into blocks or clusters of records which share a blocking attribute or are otherwise similar with respect to a defined criterion.

e.g. from [ref2.]
standard traditional blocking

sorted neighbourhood blocking
bigram indexing
canopy clustering with TFIDF



----------------------------
ref.
----------------------------
1) Ivan P. Fellegi, A. B. S. (1969). A theory for record linkage. Journal of the American Statistical Association. 64: 1183-1210.

2) Rohan Baxter, Peter Christen, A. T. C. (2003). A comparison of fast blocking methods for record linkage. ACM Workshop on Data Clearning, Record Linkage, and Object Identification.

6/12/2006

google: 世上没有免费午餐,问题是要多大代价。

抽空写两句。

前两天跟DN说起来google怎样全方位掌控着个人隐私的问题。比如他就不用google的canlendar。而我就觉得无所谓。

你用了google的“免费”服务,总是要付出点什么的。(世界就要这么运转,哪能天上掉馅饼不是。)我付出的,就是我输入的关键字被它跟踪,刻画我这个人的爱好;我的邮件上下文都在gmail中;我的日程表为google内部工作人员可见。这就是我与google 签订的隐含协议:出卖我自己,买来些服务。

你用hotmail,用yahoo,用baidu,都是这样。但我只跟google做这样的买卖。因为它说,dont be evil,而且也在这么做。

美国政府向yahoo, AOL,微软, google要求搜索数据,只有google拒绝。大家就可以理解google是如何的难得了。
在前者的合同条款中(也许写在那个谁也不会读的服务条约中,也许没有),“你”的信息被完全卖给了全世界,它们可以为所欲为。而在google的合同中,只是和google在做交易。两相比较一下哪个买卖更划算一目了然。

也正是如此,当google.cn开始为了进入中国提供经过过滤的信息时,人们才那么的焦虑以及不安。因为,在原则问题上退一步还是一百步,没有区别。换句话说,同样是我的个人资料交出去,买到的信息不是本来的样子。已然隐隐让人觉得有些亏本。况且,能在提供信息上让步,就未必不会再信息保密上让步。大家都不是傻的,看看itwire上给的统计数据“... Brin told Reuters that only 1% of Chinese users accessed Google.cn with the rest going to Google.com.”

更新的报道是“Google创始人考虑抛弃谷歌,策略与理念冲突”当然我不知道这个声明背后是什么原因。可能是因为作了让步的google仍然时不时受到GFW的干扰屏蔽,所以放些话来造势和谈判?或者像连岳说的:“Google也许知道了没有"半人半奴"这种选项”。我不要求google一贯正确,但很欣慰看到他一直坚持自己的原则。

等到他放弃原则的那天,就是我开始放弃google的那天。

希望那天不会到来。

6/08/2006

Magic Cube!


实验室的法国佬原来是一个魔方高手。据说在四○多秒之内就可以拼好六面!!

让他demo了一下,zmazing!运指如飞。我很想录下来他拼魔方时候的状态放到blog上……

小时候我有过一个魔方啦。我总是先拼一面,然后想拼另一面,似乎从来没有把两面都拼好过。(印象里有一个小姨夫,拼出过两面。不知道是不是记错了)所以最后那个魔方的结局是,被我一块一块的掰下来,再重新安上,做成一个完美的六面。以后就以此为乐了,魔方被我找到了新玩法!0(^_^)V

我要不要告诉那个法国人,我能在四十秒的时间中把魔方拆了再装好?

今天才知道魔方的正确思路:先拼一面,然后拼紧接拼好那个底面的一条变,然后是之上的第二条边,然后是最后一条(连同上底面)。打个比方,就好象我们编一个笼子那样。原来我一直以为魔方是一个面一个面的拼的!

今天试了一下,可以很顺利的拼好底面和第一条边,过几天去攻克下一条。网上有很多教程来得,不过我决定先死磕死磕再看。

又刺激了我的购物欲:去tokyo hands,买个魔方。:)

无奈的JOKE&联想与微软联手



今天发现donews的很多blog侧拦上都都出现了一个扎针小人图。原因在这里



似乎是因为G F W又让很多国内的用户不能访问google.com, gmail,gtalk了。跑去问了问jjgod和lulu,他们都用的很正常。不知道到底波及的范围有多大。但是看到大家沸反盈天的,估计也不是无中生有。

跟lulu小讨论了一下,说到大家为什么每次一被ban就这么群情激昂。我比较认可的结论是,google除了很好的服务之外,更被(中国的)用户赋予了“平等”,“正直”等等这样的意义。所以google的服务被ban一次,就身临其境的感觉自己的权利就被褫夺一次。(大概也是因此,谷歌这种妥协的做法,让大家都感到无奈的愤懑吧。)

看到水母上转来的帖子:八荣八耻救谷歌
---------------------
八荣八耻救谷歌
[本篇全文] [本篇作者:jiangxz] [进入讨论区] [返回顶部]1发信人: jiangxz (robert), 信区: Google
标 题: 我上google由被盾到解盾的经历
发信站: 水木社区 (Sun Jun 4 14:35:54 2006), 站内

这两天老是上不去google,用了好几天baidu.
刚刚看版上有人gmail用不了,网友提议url里加上"八荣八耻"之类.
我用的是maxthon,在搜索框里加了个"八荣八耻",然后按shift+enter
解盾了....看来真有白名单呀.哈哈.

---------------------

其实我每次在这里很自由的访问任何网站的时候,就会不由自主的忆苦思甜,想起在学校时候为了去访问google等等晕天黑地找代理的日子。:)

又,看到联想的新动作:
联想全力支持微软 所有PC机不预装Linux

前两天看到杨元庆跄天哭地的让政府出面帮忙解决进入美国市场的问题。
现在就看到了一个新的尝试:在关键的时候想到和微软联手。这到让我有些刮目相看了。目前我也没有什么评论,些许赞赏的保持持续关注中。

6/06/2006

About rdfs:member

0. rdfs:member in RDF vocabulary
rdfs:member:"is an instance of rdf:Property that is a super-property of all the container membership properties i.e. each container membership property has an rdfs:subPropertyOf relationship to the property rdfs:member."

ReferTo: RDF Vocabulary Description Language 1.0: RDF Schema

1. rdfs:member with protege
Recently, I am trying to build a domain ontoloy with protege version 3.2 alpha. To reserve the full expressive capability, the project is set to be
owl full. Then I

  • 1) created a subclass of rdf:Seq. with the name, for example, Myseq.
  • 2) tried (and failed) to and add constraints on rdfs:member property.
    By definition, the range of rdfs:member is rdfs:Resource. But I want to constrain the resources joining that property for Myseq. e.g. empose "owl:alllValuesFrom" constraint on rfds:member. Finally found it is impossible in protege. (as far as I know now)

Protege list rdfs:member as the individule of rdf:Property in the "individuals tab", not as property in the "properties tab". Now I am curious about the reason. Might I send an email to their mailinglist? @_-?

2. A quick reference to OWL (Lite, DL, Full) construct.

Since it seems one can't use rdfs:member (or it's instance RDF:_#), I have to wondering whether it is a legal member of owl full or not. And the next question is, which rdf vocabularies are reversed in owl(ful, dl and lite).

ref:

  1. http://www.w3.org/TR/owl-ref/#Sublanguage-def
  2. http://www.w3.org/TR/owl-semantics/
    chapter4: http://www.w3.org/TR/owl-semantics/mapping.html

A brief summary:

  1. OWL Full contains all the OWL language constructs and provides free, unconstrained use of RDF constructs.
  2. OWL DL.
    uses all the OWL language constructs but with a number of constraints. Most RDF(S) vocabulary cannot be used within OWL DL. See the OWL Semantics and Abstract Syntax document [OWL S&AS] for details.
  3. OWL Lite. see here

msn就抽风吧

好不容易“小花”有渐好的迹象,统计数据又有了麻烦。
今天的统计数据根本显示不出来。
而且,同步发布也有问题。。。

6/01/2006

The world you don't know

try this:


很多事情,理论上知道跟亲身体验还真是两回事。

比如我们知道伊斯兰文都是从右向左写的。可从来没想过,他们的网页会是怎样。
上面的url是我从msn space的访问历史里面挖出来的一个。震惊啊震惊~
全部的右对齐。别扭的是是每行还是从左起。。。

很好奇他们是否国内的网站都是右对齐的……如果是的话,他们访问别的国家的网站岂不是很不习惯?

An ontology for e-learner - planning stage

predefined ontologies which might be useful

1. FOAF
2. event ontology from UMBC ebiquity group
3. Dublin Core

related work

1. Stojanovic, L., Staab, S and Studer, R. (2001) ELearning Based on the Semantic Web. Proceedings WebNet2001 - World Conference on the WWW and Internet, Orlando, Florida, USA, 2001.
2. Journal of Educational Technology & Society
Special Issue on "Ontologies and the Semantic Web for E-learning"