Analysing the Data I

In the past month I have been going through the data using different strategies. Some of the strategies consist on detailed analyses of the notations while others try to derive broad-view conclusions.

Example of a detailed analysis:

By counting the number of diagrams/participants that presented a few related characteristics, it is possible to derive results like the following:

  • In at least one of the queries with a WHERE COLUMN_NAME OPERATOR VALUE filter:
    • 15 participants represented the SQL equal operator as =.
    • 16 participants omitted the equal operator (i.e. only the column name and the value were used).
    • 6 participants represented the SQL equal operator with symbols other than equal (i.e. ≡, ==, :, ||).
    • 5 participants represented the SQL equal operator with a word (i.e. is or IS).
    • 20 participants represented the SQL greater than operator as >
    • 2 participants represented the SQL greater than operator as < (in both cases it was an error).
    • 5 participants represented the SQL equal operator with a word (i.e. greater than,  GREATER THAN, is greater than, over, or older than).
  • In the first query of the experiment (where both the equal and the greater than operators are included in the WHERE clause):
    • 8 participants chose equal as the default operator (i.e. equal omitted and greater than present).
    • 21 participants explicitly represented both the equal and the greater than SQL operators.
  • From the above counts, we can conclude that, even if 53% of the participants omitted the equal operator, we can not consider it as a default comparison operator. For queries that only had equal as comparison operator, both its omission and its representation as = are common, but when used with another comparison operator the equal operator was explicitly represented by most of the participants.

Example of a broad-view analysis:

To analyse the general structure of the diagrams it is useful to have all of them visible at the same time. For example the diagrams representing the first query of the experiment are arranged by similarity in the following picture:

SQL equal operator represented as =
SQL equal operator represented as another symbol (≡, ==, :, װ)
SQL equal operator represented with a word (is, IS)
Omitted SQL equal operator
SQL greater than operator represented as >
SQL greater than operator represented as < (error)
SQL greater than operator represented with words (greater than, GREATER THAN, is greater than, over)
SQL equal operator represented as older than (data specific)
SQL equal operator as default comparison operator (> present and = omitted in the same diagram)

Since the picture does not have enough resolution let’s use an equivalent diagram.

In that diagram a red circle is used to mark which participants represented the WHERE COLUMN_NAME OPERATOR VALUE filters with a table like the following:

The diagram also shows the distribution of participants that in the first query represented the structure of the database with a table (blue), those that represented it with a list of field names inside a box (pink), and those that did not represent it (green).

Among the diagrams collected in the experiment there is not a single pair that uses the same notation. This confirms how diverse are the adhoc notations used by programmers. However, it is possible to identify groups of around 10 participants that used similar notations.



First preliminary results

Today I presented the first preliminary results of my research at the last meeting of the Scientific Writing Course. You can download the slides from here, or see them online at SlideShare.

Random draw!

Thanks a lot to all the people who participated in the study, or helped to publicize it. The collection of data for the first stage is finished 🙂

Enjoy the video!

This was an amazing experience for me, and I hope you enjoyed it. Right now I am preparing the first slides with preliminary results, which I will be presenting at the Scientific Writing Course on July 6th. The slides will be posted here in the blog in two weeks. My task for what is left of June and July is to analyze all the data you have provided, so we can determine how programmers represent database queries using diagrams.

Thanks to everyone 😀

Précis: Comparative ease of use of a diagrammatic vs. an iconic query language

A. N. Badre, T. Catarci, A. Massari, and G. Santucci: Comparative ease of use of a diagrammatic vs. an iconic query language. In Interfaces to Databases. Electronic Series Workshop in Computing, Springer, pages 1-14 (1996).

Few experiments in the database field have validated the influence of visual query systems with respect to accuracy and time scores. The authors carried out a study to compare QBD* and QBI, specifically for the query writing task.

Two groups of sixteen participants were formed based on the results of a background questionnaire. Each group attended to a short training section and then utilized one of the visual query systems to represent six queries. The ANOVA test reported significant differences in the time scores from the following sets of data: all the participants; the participants familiar with databases; and the queries with cycles or at least four entities. The participants who used QBD* spent less time, except when the query contained cycles. Comments from the participants pointed out that the use of AND as a default operator was unclear in QBD*.

The authors concluded that, when more than three entities were involved, QBI performance was affected because the query was not constructed in steps. The effects of cycles in the query for participants working with QBD* was determined to be due to the representation of correspondences between attributes and entities. When an entity occurred multiple times a number was added to the corresponding attribute names. The authors conjecture that the results favor the use of interfaces that offer multiple notations and interaction mechanisms. However, it is not clear how the user will react to a hybrid system because each participant used only one of the visual query systems.

Précis: Visual Query Systems — A taxonomy

As I mentioned last week, Tiziana Catarci, Maria F. Costabile, Stefano Levialdi and Carlo Batini did a lot of work on the area of visual query systems. They were mainly interested in the task of writing queries using visual systems, but their research has a lot in common with my thesis project. Though I cannot write a précis for each of the papers, I will mention some details before including today’s précis.

Let’s start with the paper that provided the big picture about their research: What happened when database researchers met usability? wrote by Tiziana Catarci in 2000. It is interesting to note that  at the beginning of their Ph.D. they intended to use entity-relationship diagrams as a database query interface which was the origin behind Query by Diagram (QBD or QBD*). This brought me back to the initials ideas about my thesis project. At that point the authors focused on the kinds of queries that the system will be able to express, but later on they conducted empirical studies to compare users’ performance while writing queries in SQL, QBD* and QBI. Next week’s précis — which will be the last one mandatory for the scientific writing course — will cover the comparison between QBD and QBI (a diagrammatic vs. an iconic system).

Today I will be looking at a taxonomy published by these authors in 1992. I am planning to finish reading a longer paper published by them in 1997 that also covers the classification of visual query systems. It is important to note that not only the papers are relevant to my research, but also the references provided in them. Without any more introduction, here is today’s précis:

Batini, C., Catarci, T., Costabile, M. F., and Levialdi, S.: Visual Query Systems: A Taxonomy. In Proceedings of the IFIP TC2/WG 2.6 Second Working Conference on Visual Database Systems II. E. Knuth and L. M. Wegner, Eds. IFIP Transactions, vol. A-7. North-Holland Publishing Co., Amsterdam, The Netherlands, pages 153-168 (1992).

Query systems make possible the representation of data models and requests. There is a clear division between query systems which use programming-like languages and those that utilize visual representations. The authors propose a taxonomy of visual query systems that will serve to analyze the influence of its features on HCI. The taxonomy is based on the operators available in the query language, notations, and classes of users.

According to their notation, visual query systems are classified as tabular, diagrammatic, iconic, or hybrid. Tabular representations are used by QBE, ESCHER, R2 and EMBS to display queries in 2D. The diagrammatic approach usually expresses the database schema through geometrical figures and connections, and the queries are represented by the selection of relevant elements and connections. The authors present QBD* as an example of system which uses this kind of diagram. The use of icons in query systems is illustrated with the description of ICONICBROWSER. A relevant aspect of these iconic systems is that the data model is not explicit. The authors also describe SICON, which is a hybrid system combining both diagrams and icons. Unfortunately, the figures with examples from these languages are not visible in the digital version of the paper.

Categories related to the availability of query operators and the classes of users are also discussed. However, the relationship between these categories and the examples of existing query systems is not analyzed in detail.

Précis: Why a Diagram is (Sometimes) …

Next week I will be posting the précis of one paper from the field of Visual Query Systems. I found that the work done by Tiziana Catarci, Maria F. Costabile, Stefano Levialdi and Carlo Batini is closely related to my research project. I will be able to compare: the notations used by the visual query systems they classified, with the notations used by the participants of my study.

Following is this week précis. This paper has been cited by many authors in the area of visual representations.

Jill H. Larkin, Herbert A. Simon: Why a Diagram is (Sometimes) Worth Ten Thousand Words. Cognitive Science, Vol. 11, No. 1, pages 65-100 (1987).

Diagrams are used to assist in the solution of problems in physics and engineering. The authors compare sentential and diagrammatic representations. Their main objective is to analyze the computational efficiency of informationally equivalent representations in terms of search, recognition and inference cost.

The authors define a sentential representation as a sequence of expressions; in contrast, the elements of a diagrammatic representation are located in a plane in which the concept of adjacency is richer. To illustrate their analysis, they use two examples, one from physics and the other from geometry. They modify the problem definitions starting with natural language versions, followed by the sentential representations and then the diagrammatic representations.

The main conclusions clearly point to the benefits of diagrammatic representations in the recognition and search processes, emphasizing their ability to reduce the use of identifying labels. No differences were found in the inference process. Unfortunately, detailed analyses were conducted mainly for problems with considerable spatial information; other kinds of problems were described only briefly.

A major contribution of this paper is the analysis framework. The use of data structures and inferential rules made possible a detailed analysis of efficiency similar to those applied to determine computational complexity of algorithms. However it was necessary to use a simplified model of the focus transitions between parts of the representations.

Starting to Analyze the Data

At the moment I have data from 20 participants (thanks a lot to those that gave me a bit of their time). I can not describe the diagrams yet to avoid influencing the next 10 participants. However, I can start talking about the first step of data analysis. During the next weeks, I will be extracting short descriptions (labels) that characterize the diagrams. To separate casual aspects of the notations from significant regularities, I will go through the data several times until I have created a table like the following:

Characteristic Number of Participants Number of Queries Number of Diagrams Number of First Attempts Number of Second Attempts

I am eager to see the final version of this table 🙂