Secretary: tel. +48 22 380-05-04, +48 22 380-05-05
Main phone number of the Institute: tel. +48 22 380-05-00
fax. +48 22 380-05-10
Activities and interests of the group members center around the following topics:
The leading researchers working on the above subjects: W. Jamroga, W. Penczek.
PhD student at TIBPAN
Activities and interests of the group members center around the following topics:
The research group is active in the field of intelligent systems in the following research areas:
The research group is working on massively parallel search engine to work with the Polish Internet resources in a novel way. Our specialty is systematizing online resources, and making their systematics perceivable to the user. Systematization is understood as automatic distribution of online resources into thematic groups, highlighting thematic channels in websites, labeling and categorizing documents and their groups. From the user's point of view, this translates into not only a more precise document identification - systematization enables also contextual search of both individual documents and their groups, such as channels or services, and diversification of the search engine response.
By diversification we mean variation in response, so that the user can see not only the best documents, but also the variety and thematic ambiguity, such as, for example, in the classic question regarding "game", which may either refer to playing or represent a term understandable for hunters only.
Taking into account the context is important when looking for a document that is comprehensible only in the context of other documents in a particular thematic channel. For example, when asking a search engine about the tires - which is fairly common in autumn and spring seasons - we would expect to receive in response links to websites of tire manufacturers or tire shops rather than, e.g., to sites about hard work that makes us tired. Making use of the context will allow the search engine to return links to documents with contents containing the word "tire" in which the word "car" does not occur.
Systematization understood in this way will be a useful tool for many groups of users. Scientists and entrepreneurs will be able to look for potential partners or competitors in the market. On the other hand, systematizing will help them identify interesting research areas or gaps in the market that can be exploited.
A promising direction of research is the use of search engine technology and tools for acquiring knowledge from data, text and hypertext to analyze social networks.
An integral part of the outlined research is optimization. The research group focuses on the so-called meta-heuristic methods and algorithms based on evolutionary, immunological, or swarm intelligence methods. The main advantage of the methods we have developed ourselves is the diversification of the obtained solutions. Its benefits can be hardly overestimated, especially in case of optimization of a system with a large, dynamic environment. Besides the search engine technology, the above methods can find broad application in areas like optimization of chemical reactions, control of production processes, or simulation of social processes.
Both in the case studies of social networks and wide use of dynamic environments we exploit simulation tools constructed by our team (based on our theoretical studies) for the generation of synthetic data similar to real world ones.
We also conduct research in the classic areas of knowledge acquisition from data, but with in-depth analysis of the acquisition of so-called active classifiers, i.e. classifiers based on the distinguishing between the characteristics of these objects that can and that cannot be controlled, respectively.In these and other areas of intelligent systems research we cooperate with the Polish search engine industry, foreign companies involved in the modeling market, as well as with Polish universities, such as the University of Cardinal Stefan Wyszynski, Polish-Japanese Institute of Information Technology, University of Nature and the Humanities, University of Gdansk, Wroclaw University of Technology, and foreign ones, such as Max Planck Institute of Computer Science (EU), University of North Carolina at Charlotte (USA), San Diego State University (USA) and University of Adelaide (Australia).
The Linguistic Engineering Group (Pol. Zespół Inżynierii Lingwistycznej; ZIL) deals with multiple aspects of Natural Language Processing.
ZIL's traditional area of interest is deep syntactic parsing of Polish, with the use of Definite Clause Grammars (DCG) and generative linguistic formalisms, such as Head-driven Phrase Structure Grammar (HPSG) and Lexical Functional Grammar (LFG). For each of these approaches, a grammar of Polish has been developed and implemented, with current work concentrating on DCG and LFG.
Another important focus of the Group's research is widely understood information extraction: many publications have been devoted to the automatic extraction of structured data from domain texts, to named entity recognition and to shallow parsing in general. Related work includes automatic acquisition of linguistic knowledge - including valence frames - from corpus data.
More recently, ZIL has also been dealing with the semantic processing of texts, focusing on word sense disambiguation, coreference resolution and sentiment analysis. Certain elements of semantic processing are present in the LFG parser mentioned above. More application-oriented work within this thread concerns automatic summarisation and text categorisation.
The Group is also active in the area of corpus linguistics. ZIL coordinated the development of the 1.5-billion-word National Corpus of Polish (Pol. Narodowy Korpus Języka Polskiego; NKJP), based to some extent on the earlier IPI PAN Corpus. In the process, the Group created various tools for manual and automatic corpus annotation at multiple linguistic levels, an XML schema for corpus annotation, and a manually annotated 1-million-word subcorpus. This subcorpus is the empirical basis for the Składnica treebank which is currently being developed; Składnica has already been used to train a dependency parser for Polish.
Various tools created by the Group are publicly available as open source software. They include: morphosyntactic taggers, a shallow parser Spejd, a deep parser Świgra, a named entity recogniser Nerf, a word sense disambiguation platform WSDDE, corpus tools Poliqarp and Anotatornia, etc. The Group is also responsible for the development of an open morphological dictionary PoliMorf - to be used in deep parsing and other applications - based on earlier such dictionaries. The above tools and resources are used in applications co-developed by ZIL, e.g., in a multilingual content management system.
ZIL has been and is active in multiple national and international projects.
For more information, please visit: http://zil.ipipan.waw.pl/.
PhD student at TIBPAN
The members of the group conduct research on generalizations of well-established methods of machine learning to the case of uplift modelling which concerns modelling of causal influence of a given action (e.g. marketing campaign, medical therapy) at the level of an individual by taking into account control group not subjected to the action. The theory of of linear models for the uplift case is also being developed.
The domains researched by the group include information theoretic and probabilistic modelling of a natural language. Objects of a special interest here are discrete stochastic processes with strong dependence which is measured by the rate of increase of a block entropy and a length of a maximal repetition. Such processes exhibit certain statistical properties which are close to those found in natural language productions, e.g. related to fulfilling Hilberg hypothesis. Their construction is studied as well as statistical inference for them with applications in computational linguistics.
Subsequent research direction concerns classification methods for multivariate response variables. An intensively studied special case is so-called multi-label classification when the response is multivariate variable with binary coordinates. Of a particular interest is construction of effective methods for high-dimensional data when high-dimensionality refers to large number of potential predictors as well as to dimensionality of the response. The aim of the research is development of algorithms (as well as theoretical analysis of their performance) for variable selection and prediction in this set-up.
Variable selection is also studied for high-dimensional generalized linear and additive models. Here, we study two- and multi-step procedures in which selection is executed based on information criteria after performing preliminary screening and/or ranking of the variables pertaining to values of their importance measures. The measures are constructed based on large number of small models with randomly chosen predictors. The main results concern selection consistency when assumed model for data at hand is correctly specified. The analogous problem is also studied for the misspecification case with the concept of selection consistency suitably modified.
Research concerning modelling stochastic dependence using copula-based approach is also being pursued by the group.
For more information, please visit: http://zams.ipipan.waw.pl/.
(Stanisław Ulam, 1975)
Computational Biology Group (CBG) is a new unit in the Department of Artificial Intelligence. CBG has two main areas of research:
The main achievements in the first area are a method and an implementation of a system for selecting and ranking features in classification tasks using decision trees and a Monte Carlo method (MCFS), and a system for constructing classifiers (ROSETTA) based on Pawlak’s rough sets.
In the second area, CBG has made several significant contributions to modeling pathogenicity of Avian Influenza Virus and to the recently very active research on mutations in regulatory regions and their correlations to various types of cancer.
Further work on methods focuses on finding “inter-dependencies” between significant features (implemented in the MCFS-ID system) and developing methodologies for rule networks generated from rough set models.
The current major research task of CBG is the creation of an atlas of regulatory regions in human brain: transcription regions, transcription binding sites, enhancers, chromatin structure and histone modifications. This task is funded by the National Research Centre with a Symfonia 3 grant received jointly with the Nencki Institute of Experimental Biology and The Institute of Informatics, University of Warsaw. The main goal of this research is to improve our understanding of the biological processes of glioblastoma and psychiatric diseases such as schizophrenia and bipolar disease.
The approach of CBG combines the achievements of a leading computer science institute with the recent advances of biotechnologies applied to Life Sciences. CBG offers an interdisciplinary agora for biologists, statisticians, linguists, oncologists and computer scientists. CBG’s research realizes not only Stanislaw Ulam’s prediction but also confirms that Life Sciences contribute to major developments in computer science and mathematics.
For more information, please visit: http://zbo.ipipan.waw.pl/.