Abstract
A substantial part of architectural and urban design involves processing of compositional interdependencies and contexts. This article attempts to isolate the problem of spatial composition from the broader category of synthetic image processing. The capacity of deep convolutional neural networks for recognition and utilization of complex compositional principles has been demonstrated and evaluated under three scenarios varying in scope and approach. The proposed method reaches 95.1%–98.5% efficiency in the generation of context-fitting spatial composition. The technique can be applied for the extraction of compositional principles from the architectural, urban, or artistic contexts and may facilitate the design-related decision making by complementing the required expert analysis.
Keywords
Introduction
Reason for research
The subject of spatial composition commonly appears in architecture, art, and urban planning. Nearly every design decision requires a thorough, expert analysis of the existing contexts and presents compositional interdependencies. Generally, people are fairly competent at processing basic spatial relations and principles. For instance, these basic ordering principles could be axiality, symmetry, rhythm, or a certain hierarchy of the elements.
1
A human presented with a few simple images connected by a common compositional logic will easily recognize the underlying rule and, by analogy, will be able to generate new images with regard to the identified logic (Figure 1). Example of images connected by a basic compositional ordering principle: Axial symmetry.
With training and experience, people can also develop the ability to manage complex principles in which the composition is structured upon several basic ordering principles simultaneously. Correctly interpreted principles allow for the creation of new spatial compositions, or parts thereof, in a context-fitting manner. In real-world conditions, the compositional contexts of given design problems often become so complex that an intuitive, unprofessional analysis is no longer sufficient. Thus, the need for an expert analysis follows. Currently, numerous applications of machine learning in architectural design are researched, developed, and discussed.
2
In the present research, the authors have decided to analyze the possibility of separating the issue of spatial composition and complex ordering principles processing from the field of machine learning image processing in general. The resulting method proposed by the authors demonstrates the efficiency of deep convolutional neural networks (CNNs)3–5 in the identification of complex compositional principles in the presented contexts. The efficiency was also assessed with regard to the utilization of these principles for synthetic generation of new spatial compositions and fragments thereof (Figure 2). Example of images connected by a complex compositional principle: Number of elements (three islands), axiality (all islands are collinear), and hierarchy (the islands are ordered from the smallest to the largest).
State of the art
The present research falls into the field of information architecture, as understood by Schmitt, 6 Saggio, 7 and Słyk. 8 According to its classic interpretations, the main technique of information architecture lies in architectural processes programming, as opposed to proposing any final design solutions. Usually, this process-programming is understood as a direct definition of the target algorithm parameters. 8 However, as the field of machine learning matured, researchers began to notice a slight paradigm shift within CAAD. John Gero, among other researchers, noticed that parameters of the design processes are usually too complex for the limited human abilities and machine learning could potentially offer a remedy in this case. 9 Indeed, the deep learning revolution initiated in 2012 by AlexNet 5 and its performance at the ImageNet Large Scale Visual Recognition Challenge proves that it is no longer necessary to manually craft process parameters. Providing sufficient datasets are available, a CAAD designer can only be responsible for definition of the hyperparameters of a deep neural network and delegate the complex task of parameter optimization to the network itself.
Analysis of the current state of the art in machine learning suggests that contemporary deep neural network models are already competent at image classification (i.e., AlexNet), 5 synthetic image generation (i.e., DALL-E), 10 and completion of obscured image fragments (i.e., Image GPT). 11 These tasks require practical skills in compositional principles processing and certain “understanding” of the spatial composition as a whole.
In recent years convolutional neural networks were among the most utilized deep learning methods for visual and image-related tasks. The implementation of CNNs in architectural design was either holistic (direct application of entire existing systems in generation of global solutions) or atomistic (implementation of CNNs in investigation of individual design process components). Among the holistic applications, Google Deep Dream 12 and neural style transfer 13 have been utilized to generate perspective views of architectural scenes. 14 AlexNet 5 and GoogLeNet 15 were applied in the extraction of ImageNet classes 16 from the spherical images captured in a 3D model of Osaka City. 17 Generative Adversarial CNNs 18 were used to generate new, Islamic, geometric patterns based on provided examples 19 and a CNN was trained to distinguish architectural plans from sections. 20 The more specific, atomistic approaches include pixel-level classification 21 of architectural elements represented in architectural drawings, 22 generation of floor plan connectivity diagrams for the early phases of conceptual design, 23 and extraction and classification of the tectonic space types 24 based on the isovist representations. 25
Premises and research goals
So far, the problem of synthetic image processing was usually treated holistically. The capabilities of deep neural networks in the processing of architectural, compositional contexts have neither been isolated nor sufficiently researched in separation from the remainder of the field of synthetic image processing. The more specific, atomistic applications of CNNs in architectural design did not intrinsically concern the compositional context processing.
As the aforementioned examples from the related fields indicate, most deep neural network models are capable of processing vast quantities of input data represented in numerous dimensions. Essentially, they can be used in the processing of 2D, 3D+ spatial compositions, as well as more complex data representations. Due to the way modern deep neural networks learn (i.e., through gradient descent and backpropagation), the algorithm gains the ability to solve problems without the need for explicit definition of the parameters by the designer. 26 Providing an adequately large set of positive and/or negative examples is typically sufficient for an appropriate network model to independently work out an effective algorithm. Especially nowadays, when the idea of Smart City is becoming a reality, vast quantities of data can be mined from sensors constantly probing our urban and architectural compositions. These data can be found in satellite imagery, GIS and BIM databases, CCTV monitoring, street maps, or even private smart cars and smartphones. The mentioned sources produce immense amount of information, which in its basic form would be impossible to be directly processed by a human. It should be possible to design a deep neural network capable of recognizing complex compositional premises present in a dataset of spatial arrangements. The premises extracted by the network could then be applied in order to generate new spatial compositions which respect the compositional logic provided by the contexts.
The objective of the study was to separate the problem of compositional principles processing from the general field of synthetic image generation. The authors also attempted to demonstrate and evaluate capabilities of deep convolutional neural networks in learning complex compositional principles and using these principles in generation of new, context-fitting spatial arrangements that match the logic represented in the training set. The conceivable tools utilizing the developed methodology could potentially be used in architecture and urban design to complement and inspire spatial composition decision-making during the design process, as envisioned in the Applicability section.
Methodology
Studied complex compositional ordering principle
The complex compositional principle used in the study consisted of the following conditions: 1. On the monochromatic image, there are three white islands presented against a black background. 2. The islands differ in size. 3. The islands are arranged along an axis. 4. The islands are ordered with regard to their size.
The fulfillment of each and every of these conditions could be conclusively determined with either qualitative or quantitative methods. With appropriate number of testing cases, the algorithm’s performance could be estimated statistically under each of the principles. The performance demonstrated and measured on the proposed, controlled example could hint toward the expected effectiveness of the algorithm on the real-life compositional contexts, which would be difficult to test and assess quantitatively (Figure 3). Basic compositional ordering principles constituting the studied complex principle.
Datasets
For the design, tuning, training, and evaluation of the neural network, a dataset consisting of the following subsets was used: 1. Validation set—used for design of the network model and tuning of hyperparameters in order to achieve the highest statistical efficiency. The performance on the validation set was also compared to the performance on the training set in order to diagnose possible overfitting of the parameters to the training set. Slight differences in distribution of the sets were made in order to further prevent overfitting (an overfit network would misinterpret the random noise present in small training sets as relevant relationships between the training examples).
28
2. Training set—a set of spatial compositions that represent compositional logic of the studied complex ordering principle. The set was used for optimization of the parameters aimed at extraction of the principles and utilization of these principles for generation of new compositions. 3. Test set—used for the final evaluation of the algorithm at parameters that provide the highest efficiency on the validation set. The performance on the test set compared to the performance on the validation set was used to identify potential overfitting of the algorithm to hand-tuned hyperparameters of the model. The set shares the same distribution differences as the validation set. 4. Test set that shares the training distribution—used for the evaluation of efficiency of the algorithm without additional changes to the distribution between the training and testing sets. The performance on this set compared to the performance on the regular test set allowed for evaluation of the impact exerted by the introduced distribution differences on the algorithm performance.
The generation of the training pairs was unsupervised and non-contrastive. The obscured composition fragments from the datasets were kept in the memory as ground truth positive examples for further comparison with the neural network outputs.
Distribution differences
In order to eliminate the impact of possible overfitting on the final result, slight differences were introduced in addition to basic random diversity of the sets. Most notably, distribution of the training set was differentiated from the validation and test sets. Thanks to this solution, the network had no possibility to blindly copy spatial compositions and their fragments from the training set to the test examples. The trained network was unable to simply learn to represent the entirety of the training set in the latent space parameters (representation of the training set in the hidden layers of the network) and reuse the learned compositions, since that would constrain the performance on the validation set.
The size of the islands in the training and the test sets that share the training distribution differ from the size of the islands in the validation and test sets. An island which consisted of a specific number of pixels could not appear simultaneously in both set distributions. This condition also ensured that the shape of the islands could not be shared between the sets. Furthermore, the possible orientations of the main composition axes have been differentiated. The angles of the axes have been restrained. For the training set and the test set that shared the training distribution (Figure 4) Allowed composition axes angles in the training set and the test set sharing the training distribution. Allowed composition axes angles in the validation and test sets.

Generating the datasets
In order to create the datasets, a separate non-neural algorithm was used. The algorithm generated random spatial compositions that fulfilled the required compositional ordering principles and assumed distribution differences between the sets. On each image example, the algorithm would generate three islands of different size arranged along an axis and ordered from the smallest to the largest one. The size of the islands was determined based on the number of grouped, adjacent pixels (corner contact was also assumed as sufficient).
Mathematically, the axiality principle could hardly ever be fulfilled with absolute precision under the established study conditions. A certain margin of deviation in axiality needed to be defined. The acceptable collinearity margin was defined with the following in equation Tests of the axiality principle on the images generated by the island generator.
Tested scenarios
The study was divided into three scenarios of varying scope, complexity, and difficulty. The division allowed for evaluation of neural networks performance under differently formulated, but comparable tasks: 1. Generating the entire composition by the neural network based on the input composition with a covered fragment (with expected reconstruction of unobscured fragments). 2. Generating the covered fragment of the composition on its own (without reconstruction of unobscured fragments). 3. Generating a single, covered pixel of the composition (the most basic compositional decision) (Figure 7). The three scenarios tested in the study.

Each of the scenarios was managed by a separate deep convolutional neural network. Structurally, the network models were partially inspired by AlexNet.
5
The networks consisted of two parts connected with each other: 4. Convolutional neural network which, in simplified terms, was responsible for the extraction of the low-level features and compositional principles from the input images. 5. Fully-connected, feedforward neural network which in, simplified terms, was responsible for the higher-level processing of the extracted features and compositional principles, and for the context-fitting generation of the missing composition fragments (Figure 8). Statistical determination of the normalized pixel generation threshold value in the first scenario. The value of 0.5 provided the best performance.

Following the activations of the final layer, a pixel generation threshold was applied. The threshold was determined statistically, on the basis of network performance on the validation set, measured with the evaluation algorithm (Figure 9). Statistical determination of the normalized pixel generation threshold value in the second scenario. The value of 0.4 provided the best performance.
Methodology for evaluation of results
The results were evaluated qualitatively with a custom-made, non-neural parametric algorithm which checked the fulfillment of the basic compositional principles in the following sequence: 1. Does the generated image have three islands? 2. If so, is the composition axial? 3. If so, are the islands ordered from the smallest to the largest one?
The principles were checked hierarchically. If the number of islands generated by the neural network was lower or higher than three, the composition was not checked for axiality and order. Similarly, if the composition was not axial, the order of the islands was not checked. The evaluating algorithm counted the positive examples under each of the principles for the entire validation and test sets. Finally, the algorithm yielded the quantitative percentage of the correctly generated spatial compositions. Importantly, the evaluating algorithm was not used while training the networks, due to the utilized network optimization technique (backward propagation) and the inability to define partial derivatives of the calculated score with regard to individual network parameters. 26 The algorithm was used only for the final evaluation of the fitted model and for tuning of network hyperparameters between the training sessions. For the training of the networks in the first two scenarios, Huber loss function was used, 27 whereas in the final scenario the network was trained with the use of binary cross-entropy loss function.
Experiments and results
Generating the whole composition
In the first scenario, capabilities of a deep convolutional neural network were tested in terms of generation of the entire image based on the input spatial composition with an obscured fragment. The training set consisted of 7000 images, whereas the test sets and the validation set consisted of 700 images each. The images measured 16 by 16 pixels. Such small size was already sufficient for encoding of the tested compositional principles, as it allowed for satisfactory randomization of the datasets and was relatively fast to process. Each of the training images was copied 100 times and each time a gray rectangle of a varying size and position was generated to obscure a part of the composition. The neural network had to fill in the missing composition fragment and reconstruct the remaining part of the image. The network was trained for three epochs.
Network model in the first scenario
In the first scenario, a deep convolutional network was used with a fully-connected, feedforward final section. The input images were represented by 16 × 16 × 1-dimensional arrays. The output images were also 16 × 16 × 1-dimensional. As a regularizing technique, a dropout of 10% of the neurons in the fully-connected layers was applied (Figure 10).
28
The model of a deep convolutional neural network used to generate the entire composition in the first scenario.
First scenario results
As expected, with an increase in complexity of the tested compositional principle, the effectiveness of the neural network decreased. The effectiveness measured on the test sets was as followed:
Three islands: 51.9% on the test set and 57.9% on the test set that shared the training distribution. Three islands, along an axis: 42.9% on the test set and 51.1% on the test set that shared the training distribution. Three islands, along an axis, in order: 22.7% on the test set and 39.3% on the test set that shared the training distribution (Figure 11). Effectiveness of the network in generating of the entire images fulfilling respective principles. In addition to the dropout, the early-stopping regularization technique was applied to further reduce overfitting.
In the first tested scenario, the network was not always successful at generating the correct solutions. The first, basic requirement of the needed number of islands was respected relatively well. However, as complexity increased, the performance decreased. Approximately every fourth image fulfilled all the necessary requirements (Figure 12. Ex. 1–2). In raw activations of the final layer, preceding the threshold, the correct pattern was often noticeable. Yet, individual activations were too weak to trigger the pixel generation (Figure 12. Ex. 3). In the first scenario, the network not only had to fill the obscured part of the composition, but also to reconstruct the visible fragments of the input image. Due to this issue, the algorithm almost always changed the spatial compositions contexts, which further lowered the performance (Figure 12. Ex. 4). This problem was overcome in the second scenario discussed later. In examples in which almost the entire image was obscured, the network failed to generate activations strong enough to trigger pixel generation. In these cases, the activation pattern resembled averaged values of pixels in the entire training set (Figure 12. Ex. 5). Examples of spatial compositions generated by the deep convolutional neural network based on the contexts provided in the test set. From top to bottom: input image with an obscured fragment; activations of the final layer of the network; final generated compositions. Choice of positive and negative cases.
Generating a fragment of the composition
Due to the changes made by the network to the encountered compositional contexts in the first scenario, in the second study only the obscured fragment of the composition was generated by the network. The training set consisted of 7000 images, the validation and test sets had 700 images each. As in the first scenario, input images measured 16 × 16 pixels each. Each of the training images was copied 36 times and a fragment of the image was obscured with a gray square of random position whose dimensions equaled 11 × 11 pixels. The network was only expected to generate the missing part of the spatial composition. The training took 75 epochs. Early-stopping was applied to prevent overfitting.
Model of the network in the second scenario
In the second scenario, a model similar to the one used in the first scenario was applied. The network was provided with an initial convolutional section and an ending, fully-connected, feedforward section. Input images had a form of 16 × 16 × 1 arrays, whereas output images had a form of 11 × 11 × 1 arrays. Dropout of 10% of the neurons in the fully-connected layers was used (Figure 13). The model of a deep convolutional neural network used to generate fragments of the compositions in the second scenario.
Second scenario results
As in the first scenario, the performance dropped with an increase in complexity of the tested compositional principles. It is difficult to unambiguously state which of the scenarios was more challenging for the network to solve. Statistically however, in the second scenario the network had to generate larger portions of the original composition compared to the one in the case of first scenario. The fragment sizes tested in this scenario were always 121 pixels big, whereas in the first scenario the average size of the obscured fragment was 81 pixels. Even with the larger objective composition fragments, the results were marginally better than in the ones achieved in the first scenario.
Three islands: 52.9% on the test set and 64.6% on the test set that shared the training distribution. Three islands, along an axis: 41.6% on the test set and 57.6% on the test set that shared the training distribution. Three islands, along an axis, in order: 22.8% on the test set and 45.7% on the test set that shared the training distribution (Figure 14). Effectiveness of the network in generating fragments of the images that fulfilled respective complex principles. In addition to the dropout, the early-stopping regularization technique was used to further reduce overfitting.
As the problem of network’s interference with the original, unobscured composition parts was eliminated, the compositional principles learned by the algorithm were clearer to interpret. In the majority of cases, the number-of-elements principle was followed. Also, performance measured on the test set that shared the training distribution was substantially higher than the one in the first scenario. Contrary to the previous scenario, in the cases when the contexts that remained in the unobscured parts of the image were very limited (Figure 15. Ex. 5), the network managed to generate the correct solutions. Examples of fragments of the spatial compositions generated by deep convolutional neural network based on contexts provided in the test set. From top to bottom: input image with the activations of the last layer in place of the obscured area; input image with the final generated composition fragments in place of the obscured area. Choice of positive and negative cases.
The simplest compositional decision—generating a single pixel of the composition
In the final scenario, the effectiveness of the network was tested in the most basic compositional decision-making, that is, a single pixel generation. In order to increase the weight of a single pixel on the composition as a whole, the size of the processed images was lowered. The reduction of the image size to 8 × 8 pixels allowed a single pixel to affect the axiality principle. In many cases, a single pixel on a 16 × 16 image was unable to force the composition to exceed the collinearity margin defined in the methodology section. With 8 × 8 image size, a single pixel could also break the number-of-elements principle (a single-pixel-big island or a pixel joining two islands together), as well as the ordering principle (a misplaced pixel could lead to two islands that shared the same size). The training set consisted of 7000 images. The validation and test sets were 700 images each. Each of the training images was copied 64 times and each of the pixels was obscured one at a time.
Model of the network in the third scenario
In the final scenario, a deep convolutional network was used as well. The input images formed 8 × 8 × 1 arrays, whereas the pixels were outputted as a single, floating-point value. No dropout was applied (Figure 16). The model of a deep convolutional neural network used to generate the single pixel of the composition in the third scenario.
Third scenario results
In the most basic decision-making, the network achieved a very high efficiency with regard to the entire complex compositional principle. In most of the tested images, the network would correctly fill in every one of the 64 pixels.
Three islands: 97.9% on the test set and 98.5% on the test set that shared the training distribution. Three islands, along an axis: 96.7% on the test set and 96.5% on the test set that shared the training distribution. Three islands, along an axis, in order: 95.1% on the test set and 96.2% on the test set that shared the training distribution (Figure 17). Effectiveness of the network in generating single pixels that fulfilled respective complex principles. The baseline performances for randomly chosen pixel values were the following: 74.9% for the number-of-elements principle, 71.0% for the number-of-elements and axiality principles, and 69.0% for the number-of-elements, axiality, and ordering principles.
In nearly each case, the network correctly filled the missing pixels in the composition (Figure 18. Ex. 1–3). As it stems from the activation patterns, the network consequently avoided generating pixels between the islands if it would result in the coalescing of the islands (Figure 18. Ex. 1–4). The network also strongly avoided generating white pixels in the middle of the black void, unless there were only two other islands present in the composition. The continuity of thin, one-pixel-thick islands was preserved in most of the cases (Figure 18. Ex. 1, 3). The rare mistakes were mostly caused by joining the islands together (Figure 18. Ex. 5) and by breaking of the continuity of the thin islands (Figure 18. Ex. 4). Examples of the network’s results based on the test set. Each of the pixels was covered and generated separately. From top to bottom: input composition; composite image made out of 64 activations for each of the pixels; color map superimposed on the input image showing the decisions made by the network for each one of the pixels (red means that the generated pixel would be white, blue means that the generated pixel would be black). Choice of positive and negative cases.
Edge cases
In order to complement the quantitative performance assessment, an additional, qualitative analysis of the hand-drawn examples was conducted. The examples consisted of edge cases designed to test the limits of the studied algorithm. When the smallest one-pixel-big island was covered and there were only two islands present in the composition, the network would correctly decide that the tested pixel must be generated in white (Figure 19. Ex. 1–3). In the cases in which the pixel could impact the hierarchical order of the islands, the network would usually follow the principle and refrained from generating the uncertain pixels (Figure 19. Ex. 2 and the small island in Ex. 3), but in a few rare cases the principle was broken (Figure 19. The medium island in Ex. 3). In case of the unusually shaped, but valid compositions not represented in the training set, the network would usually produce correct solutions (Figure 19. Ex. 4). The results for the hand-drawn, edge-case examples. From top to bottom: input composition; composite image made out of 64 activations for each of the pixels; color map superimposed on the input image showing the decisions of the network for each of the pixels (red means that the generated pixel would be white, blue means that the generated pixel would be black).
Applicability
Beyond the theoretical demonstration, similar, more advanced systems could be applied to real-life compositional problems. While the possible application range seems wide, some direct use cases deserve to be noted.
In urban design a similar method could be used for extraction of the formal compositional principles represented by the desired types of urban development. The examples of the positive contexts, presented in a machine-readable format, could be fed into a neural network model responsible for the extraction of the complex compositional principles present in the data. The extracted principles could be then applied by the algorithm to a different urban context complementing the design process and providing a starting point for the subsequent decision-making. To demonstrate the idea, a more complex version of the second scenario algorithm was trained on the fragments of the figure-ground diagram of Warsaw downtown. The trained algorithm was then used to patch the missing plots in the Warsaw suburbia, currently under rapid development. The compositional principles present in the downtown figure-ground diagram and extracted by the neural network were applied to the suburban contexts (Figure 20). Examples of the urban structures proposed by the neural network (within the red box). The network was trained on figure-ground diagram of downtown Warsaw and was applied to diagrams of rapidly developing suburban settlements. Even at the resolution limited by the network’s complexity, some basic compositional principles present in the contexts seemed to be followed. The network usually respected the urban density, axiality, and general scale of the buildings.
The demonstrated results suggest that within the sparse contexts present in the figure-ground diagrams and limited accessible computational power, the algorithm was able to extract and apply some basic, complex compositional principles. A larger neural system, complemented by additional urban contexts, including GIS, Smart City data, or functional background could be applied to real-life urban scenarios and provide assistance in the design process. As demonstrated by the higher measured performance in the third scenario, simple compositional queries could be managed even by the current generation of neural network models. In architecture, the network could be trained on a set of desirable compositional solutions present in the chosen architectural projects and encoded within the floorplans, sections, or BIM models. As with the urban example, the relations between the compositional elements decoded by the algorithm could help define the formal framework of the architectural design, consisting of the proposed arrangement of the concerned elements within the designed building model. This framework could then be used as a guideline, delineating the hidden, compositional principles encoded within the references provided by the designer. This solution would be conceptually similar to neural style transfer, 13 but would offer a more controlled, atomistic assistance limited only to spatial composition. Due to the flexibility of neural networks with regards to the dimensionality and format of the input data, a similar system would be applicable to 2D and 3D visual arts. A neural algorithm could be used to delineate the preliminary composition of the artwork based on the selected reference material.
Summary, conclusions
The presented article demonstrates the capacity of deep convolutional neural networks to extract complex compositional principles represented in training image sets and to utilize these principles in practical generation of new spatial compositions and fragments thereof. With a correct model and application framework, convolutional neural networks can successfully be used in the processing of compositional problems. The performance of the algorithm can be measured both quantitatively (through statistical analysis) and qualitatively (through separate parametric evaluation algorithms or through a manual analysis).
The chosen methodology imposes certain limitations to the direct applicability of the developed system. The studied method discussed in the article was tested and evaluated under three comparable scenarios that vary in scope and difficulty. In line with the expectations, the performance of the algorithm decreased with increasing complexity of the tested complex compositional principles. Similar reduction in performance was noticeable as task difficulty between diverse scenarios increased. The efficiency of the studied models in solving difficult and complex compositional problems was not satisfactory. Approximately, only every fourth image generated in the first and second scenarios met all the required principles. Considering the degree of complexity of the real-world design processes, the developed tool was not found sufficiently mature to substitute the insight of an architect. Nonetheless, even if the performance was not high enough for a standalone application of the tool in architectural practice, the method could still be used to complement the design process, as well as to aid the required expert analysis, provided the proposed outcomes generated by the network are re-evaluated by the operator. Furthermore, the dynamics of development in machine learning and synthetic image processing suggests that there is always space for the further improvement or optimization of the existing algorithms. For example, significant advancements have recently been made in the field of vision transformers (ViTs) 29 which can learn in a self-supervised way and are already starting to surpass CNNs in image processing tasks. 30 Some flaws of the presented method have been identified in application and testing. The used training datasets of 7000 images were relatively small compared to the state-of-the-art machine learning models, but were sufficient to successfully demonstrate the viability of the proposed framework. Significant differences between the performance on the test set and the test set that shared the training distribution suggest perhaps too severe distribution differences introduced between the sets in order to eliminate the impact of overfitting on the measured performance. The distribution difference principles could be interpreted by the networks as additional compositional principles, which would further lower the performance on the final test set.
However, the performance measured for the simpler, but complex compositional tasks (third scenario) reached a level of 95.1% to 98.5%, which was close to the baseline of an expert, human analysis. It should be expected that more complex and well-optimized machine learning algorithms could potentially surpass the expert-level even at difficult compositional tasks. In line with the current trend, various neural network models and other machine learning algorithms should be applied more and more often in the general design practice. 2 The scope of arising possible applications of the techniques similar to the one presented in the article seem wide, as such methods could be implemented in architecture, urban design, or visual arts.
Footnotes
Declaration of conflicting interests
The author(s) declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: The author(s) report a grant from Warsaw University of Technology, Faculty of Architecture, during the conduct of the study.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Warsaw University of Technology, Faculty of Architecture (504/04584/1010/44.000000).
