Abstract
Knowledge based repositories are becoming more and more common. Such repositories tend to be very large, therefore introduce a challenge in managing their operations efficiently. This paper offers a mathematical model to properly manage these repositories. The purpose of the model is to assist in locating a needed artifact from within a given repository. The process of finding an artifact in the repository, is based on two types of links that exist between artifacts: one that is established via feedback from the users and another from the creators of the artifacts. The model also offers a confidence rank function based on grading from the users. All the management operations are processed efficiently.
Introduction
The amount of information surrounding us is growing rapidly in all categories of knowledge, and subsequently repositories are being created. The aim of this paper is to present a general mathematical model assists the management of a repository efficiently.
Repositories exist in almost every field of research, for example: Archaeology [2], Chemistry [7], Numeric Methods [10, 18] etc. The software field is notorious for heavily utilizing repositories: configuration management repositories [19, 2], open source code and testing libraries [1, 12] are of the more commonly found. They are used for various tasks: distributed configuration management [21], change based repositories [17], example repositories [6, 13], developing a verifying compiler [5], code completion [6], reusing code [14] and many more and can be found on SourceForge [22], Google Code Project Hosting [11], Krugle [15], and other publicly accessible websites.
Numeric method repositories play a complementary role as a subset of the more comprehensive software repositories. A numeric repository is itself a software repository that concentrates on performing a numeric algorithm as a functional software program. These repositories can be published in a wide variety of software programming languages, as such, they have an unlimited amount of restrictions or freedom of implementability in other projects. Therefore, assurance and reliability in a numeric repository is a compelling prerequisite to any user looking to apply the numeric program in their own project.
A repository is only useful if a user can easily execute operations on it, the more operations readily available, the more useful the repository. The following list presents five operations each repository must execute as fast as possible [3, 14]:
Find the most compatible artifact for the user’s requirements. Find related artifacts to a certain artifact. Gather information concerning the quality and compatibility of an artifact. Create a set of artifacts that poses a certain attribute. Perform statistical analysis.
The model described below was first introduced in [8]. We generalized the model to conform to all types of repositories. This paper is organized as follows: Section 2 describes the model in general form. Section 5 defines a pairing function which is used in Sections 3 and 4 where it describes the way the model encodes and uses the different links between artifacts. Section 6 recalls the confidence rank function [8], and describes the effects of the links between the tests on it.
A repository containing an immense amount of artifacts can hinder efficiency. Our goal is to maintain efficiency of management operations regardless of the repositories’ size. Our model appends meta-data, informational relationship links to related artifacts, and categorizes each artifact and its relevant use.
The model depicts the repository as a
Unique identifier: Each artifact is assigned a unique identification integer. Confidence rank: Assigned to the artifact by the confidence rank function described in Section 6. Set of undirected links: A set of artifacts that are connected to the current artifact. All connections are bidirectional and further expanded upon in Section 3. Undirected links set size: The size of the set encoded in the row 3. Set of subunits: A set of artifacts that are subunits of the current artifact. These connections induce a hierarchy connection further explained in Section 4. Subunits set size: The size of the set encoded in row 5. “Original” artifact: The identifier number of the artifact the current artifact originated from, namely the unit the current unit is a subunit of. This is closely related to the set of subunits and is expanded upon in Section 4.
Although all artifacts are independent of one-another, there are nonetheless relational links between them. For example:
If there are multiple versions of the same artifact, there is a need to maintain a link between an early version and a revision. This helps the user find any group of versions and choose the version that suites his needs the most. Normally artifacts are not executed alone. Namely, different artifacts tend to be executed together. If this information is accessible it makes it more practical for the user to find all related artifacts.
The process of finding an artifact is based on these links. A user begins with a certain artifact (accessed by its unique identifier or by a text-based search), and by exploring the neighboring links of the artifact, finds other artifacts best fitted for the user needs.
The links are established by the creator of the artifact and by the users who searched for them. The creator of the artifact can link the artifact to other artifacts that are related to it. Users can also establish new relational links for the given artifact after using it. Notice that linking two artifacts due to one usage together can generate many links, therefore, a link is established only after a group of users created it.
The model needs to persist these links. Notice the different links induce a graph. Since the links discussed are bidirectional, the graph created is undirected. The interesting observation is that the graph supplies a lot more information than the binary connection supply. Specifically using one of the known graph traversal algorithms with a certain artifact as the root node, can help find additional artifacts that might be suitable for the user needs. Moreover, the close radius of a certain artifact to another can offer much insight into the artifact itself (an example is provided in Section 6).
As described earlier, the model stores information regarding the artifact separate for each artifact. Therefore, each artifact only needs to store its direct neighbors. This means that even if the repository contains many artifacts i.e. the graph has many nodes, the complexity of storing the information depends only on the number of artifacts the current artifact is linked to, i.e. the degree of the node.
The model stores the neighbors of an artifact using the pairing function that will be described in Section 5. A set of neighbors can be regarded as a tuple, where the order is arbitrary. The model encodes the tuple into an integer using the pairing function. Therefore, the model uses two rows to encode the graph:
Row 3 contains the set of artifacts that are connected to the current artifact. The model encodes the identifiers of this artifacts using the pairing function. Row 4 contains the size of the set encoded in the row 3. The size is mainly for extracting the set from the encoded number.
Obviously the set of neighbors are dynamic and prone to change; for instance, a new linked artifact can be created, a linked artifact might be deleted from the repository, or new links can be created by the user. The storing technique of the model makes the operations dealing with these changes efficient: Let
As mentioned above, the radius of an artifact can provide much insight into its relationships. Extracting the first neighbor of an artifact takes
Smaller code segments are easier to define and easier to verify. An artifact may be comprised of subunits. Some users may need to use only the subunit and not the entire artifact. In this scenario the repository manager can split the unit into different subunits, and create a new artifact for each of the subunit. However, it is expected that the original artifact will also stay in the repository since there are users that need to use the combination of the subunits. Notice a subunit can also be split recursively into a sub-subunits and so on.
Crucial observations can be attained from the method of creation of the subunits: after the new subunits were constructed, a user can come across a new subunit but actually prefer to use the original unit. Therefore, preserving the connections between a subunit and the original unit is paramount. Moreover, a user can find the original unit in the repository, but only needs a single subunit, thus, there is also need to store the connections in the opposite direction (the connections between the unit and the subunits). In summary:
Each artifact An artifact may have zero, two or more artifacts as subunits, therefore producing a directed forest. The root of each tree is an original artifact. A leaf in the tree is an artifact that was not split to subunits (over time this may be changed).
One of the most important aspects of a repository is for a user to locate an optimal artifact, based on their requirements, in an easy and efficient way. The forest described above can help the user with this task. A user may need to traverse a tree in the downward direction in order to find the subunit that fits, or may need to traverse upwards to find a combination of artifacts that suits their requirements. Sometimes there may be need to travel more complex paths that go both up and down in order to also examine different branches of the tree.
Notice once again when trying to encode the forest, an artifact needs to persist information from its immediate neighborhood. A unit needs to contain the identities of its subunits and the unit it originated from (if one exists). However, the main difference between the forest and the graph described in Section 3 is that there are two different kind of neighbors.
The model uses three rows to encode the forest:
Row 7 simply contains the unique identifier of the artifact that the current artifact originated from. Row 5 contains the set of subunits of the current artifact. As before, the model encodes this using a pairing function. Row 6 contains the size of the set of subunits. This is used mainly for applying the pairing function.
This encoding is efficient in many ways: It is reasonable to assume that the number of subunits of a certain artifact is bounded, hence the complexity of finding a subunit is
Users can combine both types of graphs in the search for compatible artifacts. For example, the search can focus on a certain radius of both the artifact and its subunits, or all the nodes in the path from the artifact to the root. The search could also look at the radius of all the artifacts that are subunits of the same unit, and so on.
The links described in Sections 3 and 4 induce graphs (of very different forms). These graphs need to be stored in the model, however for this to be done without consuming to much space the graphs are encoded using a paring function: Each item has a unique number and all the item’s numbers will be processed, using the pairing function, into a single number that represents the encode of all this connected items. The advantage of the pairing function is the bidirectional ability to encode and invert. The function can encode all the neighbors in the graph into one number and can also extract the neighbors in the graph from an encoded number.
The pairing function maps
If the value
Define
Notice:
Since
We will get that:
So:
Now we can easily compute
Obviously the complexity of calculating both the pairing function and its inverse is
In order to extract all the values of the tuple from the encoded number, there is a need to know only the function value and the size of the tuple. The numbers are extracted recursively with the process above, such that when
In the next iteration,
Confidence rank
The confidence rank function used in row 3 was described in length in [8]. The confidence rank function is used to evaluate the quality and generic aspect of an artifact. This section emphasizes the correlation between other rows and the confidence rank.
The “cold start” problem arises when ranking a new item. One of the techniques to overcome this problem is to force the user to grade the item upon first use. The model draws a border line between cold-start during the information state and the knowledge state (after gathering sufficient information about the item).
Due to that, determining the quality of the item cannot be done before it was used a number of times. We call the first number of times the “learning time”. After the learning time, there is enough experience to properly examine the quality of the artifact. The length of the learning time is represented by the learning time constant (discussed in the upcoming section).
Obviously the number of times an artifact was referenced plays a role also after the learning time period and the number of times an artifact was referenced should be taken into account in when determining the confidence rank.
The best feedback concerning the quality of the artifact comes from users who referenced it. The model contains a grading process:
An initial grade of zero is assigned to the artifact at creation. Each user reference requires that user to provide an additional grade. The grading scale is binary: if the user grades positively, the grade increases by one, a negatively marked grade will decrease the overall grade by one.
In order for an artifact to be considered “good” a large portion of the grading must be positive. The portion of positive grading must be larger than some previously defined threshold. Obviously this threshold must be larger than 1, however the exact threshold may vary from artifact to artifact. Thus the confidence rank function depends on four arguments:
The number of times the artifact was referenced, denoted by The learning time constant, denoted by The overall grade of the artifact, denoted by The positive grading threshold, denoted by
The function is denoted as follows:
There are a few different results that can be obtained:
The portion of positive grading is larger than the threshold, the value of the function is positive. Moreover, as The portion of positive grading is smaller than the threshold, the value of the function is negative. Notice also that as n grows the confidence rank is still negative, however its absolute value grows.
Choosing a learning time constant for an artifact is a complicated task. However, the connections between artifacts described in Sections 3 and 4, may aid in the decision making: if the artifact is connected to another test it makes sense to give the new artifact a similar learning time constant. Moreover, the creator can explore the close radius of the artifact and choose the learning time constant to be a function of the constants of the artifacts in the radius explored.
The last paragraph applies also for the positive grading threshold. Specifically, a creator can choose the threshold by examining the thresholds in the close neighborhood of the artifact.
Notice also that if artifact
The last discussion does not apply to the grade of the artifact; in other words, the grade of the artifact must begin from zero even if it was created as a subunit. This is a direct result from the fact that the grade is binary, even if the artifact got positive grading there is no way of knowing if the specific subunit was sufficient or the combination of subunits was overall sufficient.
This obviously also suggests that the creator should not decrease the positive grading threshold only because the artifact was created as a subunit.
We conclude that establishing the arguments for the confidence rank function can benefit greatly from the connections stored in the other rows. Moreover, the way the model stores this information makes the described operations easy and efficient.
Numerical repositories need to address issues that general repositories do not, such as:
Coping with different versions of code. Should the repository store the different versions? If so how should they be linked? Subunits: units of code that are subunits of other code units. How should the repository mange this situation? Can it be used to benefit the implementation of the operation?
Managing a numerical repository was addressed before, different search and retrieving approaches have been suggested for these repositories, all of these are text based: codes are found automatically using heuristically due to structural similarity [14], applying the
Our model manages the repository via feedback from the users and creators of the artifacts. The search is not text based it relies on connections between the artifacts. The connections are established also due to feedback from the users. Our model also manages different versions of the code. It also contains a confidence rank function that is based on grading from the user. The paper describes a mathematical model for managing a software repository. This models enables managing a large repository while still allowing the operations to be efficient. The model stores the information in a matrix, where each column represents an artifact in the repository.
There exist links between different artifacts in the repository. The model defines two types of links: non-directed and directed (hierarchy links). The non-directed links are created by the users and the creators of the artifact. Finding artifacts in the repository is done by traversing the graph induced by the two types of links. The graphs are encoded using a pairing function. The model also offers a confidence rank function for each artifact. The function deals with the cold start problem via a user’s grading process.
The model can be developed further by adding attributes to the induced graphs. Also the confidence rank function can be further researched to fit to other areas as well.
