A coherent shared address space provides an attractive programming environment for parallel computing. If such an address space is to be used in a distributed-memory system without dedicated communication hardware, software remote-data caching mechanism (software Distributed Shared Memory: DSM) must be used. Optimizing methods are indispensable for improving the performance of Software DSM schemes. That is, compiler optimization, protocol optimization, run-time system optimization, and the interfaces that enable these optimizations are required. We have introduced two compiler-assisted software DSM schemes as these interfaces. One is a page-based system (Asymmetric Distributed Shared Memory: ADSM) that uses the TLB/MMU mechanisms only for cache-misses. The other is a segment-based system (User-level Distributed Shared Memory: UDSM) that uses user-level checking codes and consistency management codes for software caching. We have proposed a compiler optimization framework that directly analyzes the explicitly parallel shared-memory source programs and optimizes them. The optimizing compiler exploits the capabilities of middle-grained/coarse-grained shared-memory access to reduce the volume of communications and to reduce the overhead for the user-level checking codes. It performs interprocedural points-to analysis and interprocedural shared-access set calculations by using interval analysis to solve redundancy elimination equations along with lazy release consistency model. We have implemented this optimizing compiler, Remote Communication Optimizer :RCOP. We have developed the cache-coherence protocols that follow the lazy release consistency model. Our experiment shows that SAURC (Software emulation of Automatic Update Release Consistency) protocol provides the best performance among the protocols that follow lazy release consistency model. We have developed the lightweight run-time system for cache-coherence based on SAURC, and we have implemented the run-time system for ADSM and UDSM on an SS20 workstation cluster. Both systems provide high speed-up ratios for the SPLASH-2 benchmark suite. The experimental results show that the combination of the optimizing compiler and Software DSM is very effective. The experimental results also show that the performance of the ADSM scheme is limited by the communication of unnecessary data, while that of the UDSM scheme is limited by the instrumentation overhead. The results of this study indicate that executing parallel shared-memory programs with automatic optimizations for remote communications under a general-purpose operating system on a stock network of workstations is feasible.