hugo-site/content/publications/2016/brain-machine-interfaces-neural-recording-neuron-processor-interface.md

---
title: "Brain machine interfaces: Neuron Processor Interface"
date: 2016-08-08T15:26:46+01:00
draft: false
toc: true
math: true
type: posts
tags:
  - chapter
  - thesis
  - CMOS
  - biomedical
---

Lieuwe B. Leene, Yan Liu, Timothy G. Constandinou

Department of Electrical and Electronic Engineering, Imperial College London, SW7 2BT, UK

Centre for Bio-Inspired Technology, Institute of Biomedical Engineering, Imperial College London, SW7 2AZ, UK

A core aspect of emerging neuroscience is quintessentially performing real-time data analysis at a massive scale. However when we observe its manifestation in state-of-the-art neural interfaces we find the hardware is very limited to specific methods that can be objectively short-sighted. This chapter aims to direct our attention to a different point of view with respect to how these sensor systems can be structured. In particular we are guided by the concept where an implant is capable of performing software defined instrumentation. This is associated with a focus that lies with enabling real time & in-vivo testing of a more diverse set of signal characterization methods. More importantly we will demonstrate that this can be made feasible for large scale distributed systems.

This particular approach is motivated by a number of factors that aim to increase performance and enable research opportunities. The first is that many aspects with regard to the signal quality of an implant cannot be predicted beforehand. As a result implementing a specific algorithm for specific signal characteristics may lead to failure or an overly conservative design because the environment can potentially be excessively noisy. By introducing the capacity to dynamically execute different processing methods on neural data the implanted system to use either LFP and EAP activity in real-time. This may be a significant element to improving the success for chronic BMI implants. Moreover the prolific development in characterization methods used for decoding neural data inhibits a general consensus for DSP techniques. This prevents a single method and corresponding architecture to be applicable in most scenarios. The second factor is that this approach conceptually enables the development of real-time resource constrained algorithms which are virtues often neglected when working with data sets. Currently most BMI development platforms have limited capabilities to allow algorithms to use external or multi-modal features to inform local operation and simultaneously provide recordings from hundreds of electrodes. This construction may be a key factor to allowing high level algorithms to directly manipulate machine learning parameters local to each implant. This hierarchical fashion should improve the efficiency of distributed BMIs for decoding information. In contrast we question the feasibility for scaling the current supervised methods that require fine measurements of each electrode's recording to approach a optimal decoding strategy. Typically the computational efficiency of this approach remains exhaustive when reconfiguring sensor parameters because it use a centralized unit that recalibrates all recording channels in an elaborate fashion.

This chapter is organized as follows; Section 31 motivates localized processing for increased efficiency and estimates to what extent we can perform on-chip processing. This is followed by Section 32 where typical methods used for neural signal analysis are introduced and the respective hardware complexity is demonstrated in Section 33. This leads into the proposed distributed processing architecture in Section 38 where the design is discussed with respect to the implementation. Section 41 demonstrates the realization of this platform. Finally Section 42 draws conclusions with respect to the digital approach to neural instrumentation.

# 31 Processing at the interface

Ideally a neural interface device is tasked with recording from a large ensemble of electrodes and transmitting information with the lowest bandwidth because the harvested power is a scarce resource for implants inside the body. However over the course of an implant's life time most signal characteristics are dynamically changing which implies that there should be an involved learning process that similarly adapts to these changes. This can also mean that the output bandwidth is constrained by the total amount of mutual information that can be retained within the device. Such a device will predict the expected recording from one time interval to the next and differentiate any new information that needs to be transmitted. Hence we should be convinced that the processing capacity or complexity for a closed or memory limited system should reflect in its fundamental ability to store information [^140].

In order to capture some high level trends with respect to processing requirements let us normalise memory capacity in terms of state variables that is independent of modality. This is particularly useful because the number of state variables in a dynamic process is a good indicator for complexity whether is a digital classifier or an analogue filter. Here we will exclusively focus on processing by assuming the signal being operated on is idealized with respect amplitude and its representation. This extends from our analysis in Section\ref{ch:T1_model} by elaborating specifically on comparing digital and analogue resource allocation associated with processing.

$$  R_{A} = \underbrace{\frac{2\pi   BW   kT   SNR^2   U_T}{V_{DD}   L}}_{power} \cdot \underbrace{\left( \frac{A_{min}}{L} +\frac{kT   SNR^2}{L   Vdd^2   C_{dens}} \right)}_{Area} $$

If we represent the resource required as the power area product for a state variable then in the analogue domain it would be represented by Equation 32. Here \\(BW\\), \\(V_{DD}\\), \\(C_{dense}\\), \\(A_{min}\\) reflect the signal bandwidth, supply voltage, capacitor density, and typical transconductance area overhead for a particular technology respectively. \\(L\\) is a normalized feature size that allows us to evaluate parameters for a particular technology and extrapolate them based on constant field scaling factors.

$$  R_{D} = \underbrace{ 2BW   \alpha   \log_2(SNR)   C_{gate}   V_{dd}^2   L^2 }_{power} \cdot\underbrace{ \alpha \log_2(SNR) A_{gate} L^2}_{Area} $$

Similarly Equation 33 represents the power area product for a digital state variable. \\(C_{gate}\\), \\(A_{gate}\\), \\(\alpha\\) parametrise typical gate capacitance, area, and overhead for each register respectively. Generally the dependency of both parameters \\(R_A\\) and \\(R_D\\) are well understood and guide maximizing system efficiency in an abstract sense [^141].

{{< figure src="/images/phd-thesis/impact.png" title="Figure 49: Impact of technology on \\(R_A\\) analogue (green) and \\(R_D\\) digital (blue) processing resource requirements extrapolated from a \\(180 nm\\) CMOS technology under constant field scaling." width="500" >}}


For neural instrumentation however both power and area requirements must be highly constrained in order to realize a device that can accommodate a large number of recording channels and remain implantable. Figure 49 shows the resource requirements for processing in the analogue and digital domain with respect to signal fidelity and CMOS technology. Either approach can present an advantage over the other under specific conditions. Digital systems appear favourable beyond \\(65 nm\\) CMOS where analogue will do better at lower SNR conditions given a technology with a larger feature size. The discussion in Section \ref{ch:T1_model} suggested the analogue preconditioning requires a resource allocation of $10^{-15}  Wm^2$ with a weak dependence on technology. Moreover if quantization is not considered then power can be entirely determined by the noise specification and the area requirements are dependent on the gain configuration. Comparing this figure with the estimate on \\(R_D\\) indicates that we should be able to integrate a considerable amount of processing capabilities before the DSP uses a comparable amount of resources. This is important because improving on-chip processing capacity ideally results in requiring less supervision and a lower wireless communication bandwidth.

While we may expect other sources of over-head and extra power dissipation components, we should instead take a moment to consider the implications of this result. Particularly when considering the claim that electronic sensing of brain activity on larger scales is not viable due to the excessive data rates derived from the principle entropy relations and the associated communication bandwidth [^142]. Clearly any degree of on-chip processing undermines this limitation because it enables us to achieve data rates far lower than that of the electrical signals by finding a more appropriate basis of encoding information. On the other hand it does raise an important point regarding the relationship of the generated output data rate and the recorded signal to noise ratio. In some sense we are simply faced with the challenge of best consolidating the recorded information towards high level indicators for specific objective functions. This allows us to approach data rates extracted from current BMI studies which are negligible in comparison to the Nyquist rates. In this light we argue that electronic methods for recording activity are the closest to realizing a viable neuroprosthetic solution in the near future when comparing optical, magnetic and less invasive BMI architectures.

Now we can make the assertion that there should be two approaches to solving the system level challenge of integrating wireless neural instrumentation systems. The first approach would be a mixed signal topology that extensively uses analogue processing such that technology has a weak impact on improving efficiency. Instead the critical component lies with the effectiveness of analogue dimensionality reduction. In such a case we need to adopt a well established algorithm that can accommodate analogue variability an still deliver exceptional signal characterization. The second approach is to rely on digital methods that deliver robust and reconfigurable compute resources that scale well with technology. This should grantee the capacity for a variety of fully adaptive algorithms capable of extracting multi-modal characteristics from recordings. This could much more valuable for experimental neuroscience at this point in time. Unfortunately not all forms of algorithms can use the low power characteristics of processing in the analogue domain. Moreover they are typically limited by underlying assumptions regarding noisy perturbations. When we introduce different contexts of operation reconfiguring the analogue is not done as arbitrarily as a digital structure would. For this reason we will adopt the digital approach in order to leverage robust reconfigurable capabilities and reconsider the analogue approach in Section 48.

The significance here is that these trends allow us to roughly estimate the complexity of algorithms for different technologies if their resource requirements made to be equivalent to that of the instrumentation circuits. We show in Figure 50 that using a $0.18   \mu m$ CMOS process should give way to approximately 100 state variables or equivalently perform about 100 operations per sample taken. In fact looking at image processors that similarly rely extensively on data intensive post processing we can see an identical dependency on technology scaling as we have predicted for various levels of digital performance at different technology nodes. It is important to note that the normalized efficiency evaluated here is in fact independent of signal bandwidth and only depends on signal to noise ratio and its relation to the supply voltage.

{{< figure src="/images/phd-thesis/Operations.svg" title="Figure 50: Analytic number of digital operations available with respect to different technologies (red) with references to the normalized performance of image processors (blue)." width="500" >}}


$$  P_{system} = \underbrace{ N_{channel} \cdot \left(  P_{Algo} + P_{Transmit} \right)  }_{In \: Channel} + \underbrace{  P_{Control} + P_{Comms} }_{System \: Level} $$

System level design of a embedded processing system for BMIs should be guided entirely by the optimization of compute power efficiency. As shown by Equation 34 we expect are two primary components in the system power breakdown. The over all objective should lie with minimizing the channel level power dissipation of the algorithm \\(P_{algo}\\) by increasing that of the system level control \\(P_{Control}\\) such that the component that scales with channel count is reduced. Secondly we should keep in mind that reducing in processed output data alleviates the dissipation of the on-chip communication \\(P_{Transmit}\\) and the external telemetry power consumption \\(P_{Comms}\\).

## 32 Methods for Neural Signals


A key component to developing this platform is a discussion on the diverse set of signal processing methods performed on neural data and their computational requirements in order to determine our system's specifications. More importantly we want to judge what is the expected complexity for some of these operators and how many variables are allocated during each process. There are in principle four different categories for the methods that are applied to process neural signals which are listed below. In practice, a single integrated system will utilize a multitude of different techniques to achieve denoising and feature extraction.

\textbf{Pre-Processing} is the filtering and conditioning of each ADC output sample using FIR, IIR, or non-linear filters. Here the objective is simply to de-noise the features or signal components. The resulting signal allows for better detection or more precise evaluation of the signal characteristics and is often closely related to the characteristics of the instrumentation circuits. This output may be considered as the raw signal recording that is used to bench mark any post processing methods or other instrumentation systems.

**Detection** is associated particularly with capturing the intermittent spike events. There is no quantitative evaluation made with regard to the nature of the detected spike but any detection may trigger the process that records contiguous samples around the detection event that are then subsequently characterized. These events are commonly triggered upon simple threshold crossings of the signal or its integral power over several samples. In some systems the interest lies only with the accurate detection of spike events which is sufficient to perform closed loop therapeutic treatment or control actuators external to the body.
**Data reduction** can be interpreted either as representing a spike waveform in terms of defining features or with a reduced basis to allow approximate reconstruction by using spike amplitude or wavelet decomposition. Representing waveforms in terms of their quintessential components allows for more efficient post processing and reduces data rates in the case of wireless telecommunication. In its primitive sense this is simply dimensionality reduction of the recorded data and is usually followed by supervised or unsupervised signal classification.


**Classification** is the predominant objective for BMIs and is the main difficulty to realize inside a embedded recording system. Such a task in spike based systems primarily performs a generalization of the detected spike shape in terms the previously detected neurons. This reflects the fact that in most cases multiple neurons can be detected by a single electrode and by distinguishing these events the integrity of information can be preserved. The objective here lies with having an equivalent spike event output as simple detection to perform actuation but with better fidelity.

It should also be clear that the above operations primarily focus on reducing the recorded signal to its primitive components in terms of spike events at the rate of \\(100 b/s\\) instead of the \\(256 Kb/s\\) data stream typically generated by the ADC. The processing layer on top of this elementary function will either aim to evaluate neural connectivity or use collections spike rates to perform inference of high-level dynamics. This application specific processing of these systems will not be considered here primarily because the nature of such a problem is very different from the more generic information extraction from recordings. As a result that system architecture should revolve specifically around multichannel trained dimensionality reduction. Even when such a task can adjoin to what is presented here it will be out side the scope of this discussion which targets more generic signal instrumentation.

{{< figure src="/images/phd-thesis/Survay_I.svg" title="Figure 51: Estimated resource requirements for different classes of algorithms use for processing neural recordings found in literature." width="500" >}}


When we survey the various algorithms found in recent literature and estimate the expected memory/computational requirements we might observe a distribution like that shown in Figure 51. For fair comparison we have adopted these methods to operate on a window size of 32 samples with three types of dectected spike waveforms when applicable and only accounted for memory allocation that cannot be shared across channels. This should give a good normalized indication of the limiting components for each method which we could then further optimize in a more specialized manner. Notice that there is a strong correlation in the memory usage and the required number of operations for most embedded systems.

Some of the most efficient spike detection and feature extraction is associated with using temporal characteristics of the spiking waveforms. Common examples include using the time interval between minimum and maximum peaks or the duration of threshold crossing for detected spikes. The defining characteristic here is that alignment buffers are not needed which leads to using very few operations per sample. Inherently the drawback is the increased sensitivity to noise which implies a very limited capacity to distinguish and classify different spike shapes. Unless more filtering is performed. There are other methods that also operate with reduced signal buffers. For instance using compressed sensing where signals a continuously re-projected with a sensing matrix. This requires a few accumulators for each coefficient being extracted but this is strictly data compression. Ultimately This may not help directly with classification or signal characterization that must now be performed off-chip before any benefit can be realized.

Many other classes of algorithms operate on a windowed basis that exploit learned mean spike shapes that are expected in the recording. Here a convolution or distance operator will indicate which class of spike is detected. For terminology let each convolution of the signal result in a feature that is used for detection and/or classification. The adaptive component of these methods leverage a significant amount of noise shaping and separation depending on what the objective function entails when the basis for convolution is being determined. The most prevalent approach is convolving with principle components which simply maximizes the signal variance in the projected space. In contrast to using temporal feature that struggle with sample limited denoising, the windowed operators should be more robust towards noise. Instead there is some difficulty in systematic alignment of the window with the spike. This aspect often motivates increased sampling rates or interpolation in order to perform accurate alignment.

The objective function for determining the convolution kernels can be oriented towards maximising sparsity[^154], signal to noise[^155], cluster separation[^49], or expectation maximization[^156] each reflecting different signal modalities and analysis methods. Although the complexity for training can be varied extensively the local operations for classification after adaptation is almost equivalent. This operation is the \\(F\\) linear projections using \\(W\\) samples onto the feature space where \\(F\\) and \\(W\\) are the number of features used and the number of samples in the window respectively. There may be some deviation from this operation if we also consider the different confidence intervals for each class is taken into account. This can be done by evaluating centroid distances in terms of the variance for each respective centroid. Conversely to such training, generalized templates averaged over a number of recording channels may also be used for feature extraction in order to share the memory requirements at the loss of not achieving maximally separated clusters for each individual channel.

Naturally it is challenging to objectively judge feature extraction and classification methods. We could always reduce the dimensionality of the search space or by simplifying the convergence strategies to reduce memory and computational requirements. For this reason the details in Figure 51 should be considered in relative terms due to the generalizations made with regard to system specifications. As many methods in the literature are not performed at the sensor interface they typically will not take advantage of processing on a sample to sample basis and opt for a batched or sequenced processing methodology.

We claim however that the segregation between methods primarily lies with whether the classification features are based on sample space characteristics or alternatively use windowed convolution operators. The former is the less rigorously justified as the features relating to amplitude and spike width have weak physiological significance but are aggressively more efficient than other methods. The characteristic requirement of the later later approach is typically related to the window size that is indirectly associated with the sampling speed in order to fit the relevant spike shape into the window. One important objective of current decoding research is related to introducing adaptive techniques that iteratively improve classification without supervision. Particularly without excessive memory requirements in order to keep track of long term statistics. To demonstrate why this can be particularly challenging, consider a simplified example of using K-means to directly cluster the sample space adaptively where we are fortunate to know the number of classes is three.

Clearly for each detected spike we would need to evaluate the distance between our data the centroids having a memory and complexity requirement of \\((1+F) W\\) variables and \\(2 F W\\) operations respectively. Then once the class is determined an additional \\(2W\\) operations are needed to adjust the centroid with the new data. This may include keeping track of the additional \\(F W\\) truncation residues that allow using a small enough adjustment weight for convergence when our quantization is limited by an 8 bit system. Now for completion, assume our window is 32 samples and we come to the conclusion that for each recording channel we to actively need to allocate nearly \\(2 K bits\\).

Typically this primitive aspect of a generic classification algorithm is intensive enough warrant not performing it locally in contrast to the complexity of the aforementioned temporal features. It also raises the challenge of trained dimensionality reduction without supervision which exclusively relies on evaluating the covariance matrix in order to minimize the correlation of non-signal components with our new basis. The above algorithm may be representative in terms of complexity with the exception that basis pursuit relies heavily on inner products. There is some relief from the fact these optimal basis change very slowly and actively adapting is only needed once few hours as the electrode recording changes slowly with respect to the neural activity. As will be demonstrated the feasibility of these more involved methods rely very much on the careful construction and memory allocation of the algorithm with respect to the processing architecture. Operations like k-means and PCA decomposition can be performed to a certain extent if the operations are broken down into a incremental procedures.

# 33 Resource Constrained Classification


Primarily to substantiate our expectations for performing processing at the sensor interface and evaluate where the system level requirements should lie. Ultimately such a system needs to encompass a significant variety of different application requirements. We will consider the implementation of two well known methods that process neural recording and generate classified events. This will be applied to the equivalent scenario where the proposed instrumentation front-end is used from Section 17 such that the digitized signal will have considerable dynamic range but lacks analogue filtering. In particular this implies the recorded signal is filtered by a first order butterworth low pass filter in addition to near-DC rejection while being sampled at \\(25 KS/s\\). We will address the rejection of low frequency aggressors in addition to the computational requirement of typical processing algorithms.

Many of the filter and processing considerations are guided by evaluating accuracy empirically and justified though constrained parametric optimization [^157]. The particular algorithms implemented here are structured in such a way they perform specific considerations for the underlying hardware. On numerous occasions we will employ single bit accumulators as approximations to the IIR equivalent feedback structures in order to improve our effective register depth through feedback. This is a primary advantage of in-channel processing where we may exhaustively make use of recorded data without having concern for the communication of these components.

Empirical validation is demonstrated by using of synthetic data sets that are publicly available online. This data is based on characterized extracellular recording where both background activity and spike morphologies are extracted from a human neocortex and basal ganglia. The synthesized recording was originally used to evaluate the performance of super-paramagnetic clustering with wavelet decomposition at different background noise levels in [^1]. Synthetic data specifically allows the inference of the ground truth resulting in unbiased performance indicators. For fair comparison of analogue and digital techniques we additionally include low frequency content from \\(1-300 Hz\\) at \\(10\times\\) of the largest peak to peak amplitude of the extracellular action potentials found in the recordings.

## 34 Spike Detection & Filtering


Arguably the most influential aspect to neural detection and classification algorithms is the signal preconditioning for systematic and accurate detection of spike events. The importance lies with the fact that detection behaviour has the significant influence on how the feature space appears when the spike is characterized with various methods. Although amplitude noise can usually be accounted for in terms of filtering. Any misalignment in the time domain due to noisy aggressors in the detection operator can up modulate low frequency components. The tendency to perform detection in the digital domain is entirely related to the instantaneous characteristic of discreet time processing which is superior to the group delay inherent to analogue implementations. Minimizing this factor will minimize additional memory for capturing any signal before the detection event.

The method proposed here tracks both the mean spike amplitude and back ground noise levels in order to assert the detection level of spike events. Motivated by using physiological characteristics to specify the underlying operation parameter \\(k_3\\) is introduced to represent the relative amplitude of background activity to that the maximum spike waveform that is intended to be detected. Or in other words if we are only interested in the closest neurons to the electrode \\(k_3\\) should be close to 1, otherwise if we also want to detect background activity with an amplitude at \\(25%\\) of the largest spiking events \\(k_3\\) should approach \\(4\\). In actuality this term should also be related to how well our classification can separate noisy detection or actual spike events.

\begin{algorithm}
	\DontPrintSemicolon
	\KwData{Sample from ADC \\(X[n]\\)}
	\KwResult{Detection events & spike window \\(W_1\\)}
	\Begin{
		\ShowLn
		Update \\(V_{LFP}\\) with $k_1 \cdot (X[n]-V_{LFP})$  \tcp*{ Track low frequency content}
		Set \\(S[n]\\) with $X[n] - V_{LFP} + k_2 \cdot S[n-1]$ \tcp*{ IIR bandpass filter}
		Set \\(G[n]\\) as $\sum S[n]\cdot FIR(2*R) $ \tcp*{ FIR bandpass filter }
		Set \\(ES[n]\\) as $S[n] \cdot G[n-R]$ \tcp*{ Energy Estimate from IIR & FIR product }
		Update \\(V_{noise}\\) with  $k_1 \cdot (|ES[n]|-V_{noise})$ \tcp*{ Estimate variance on energy estimate }

		\uIf{$ES[n] > V_{th}$ **and** $ES[n] > max_{local}$  \tcp*{ Find threshold crossing or new local max } }{
			Update \\(V_{th}\\) with $k_1 (ES[n-1]+2V_{noise} - k_3 \cdot V_{th})$ \tcp*{ Adapt peaks and varience}
			Initiate Spike Alignment \;
			Set \\(max_{local}\\) to \\(ES[n]\\)  \tcp*{Set local maximum}
			Set \\(index\\) to \\(0\\) \tcp*{Initiate data pointer}
		}

		\uElseIf{Currently alligning spike (\\(index<16\\)) }{
			Set \\(W_1[index]\\) to \\(G[n-10]\\) \tcp*{Store spike waveform with delayed samples}
			Set \\(index\\) to \\(index+1\\) \tcp*{Increment data pointer}
		}

		\uElseIf{Idle state (\\(index>31\\))}{
			Set \\(max_{local}\\) to \\(0\\) \tcp*{Finish classification & find next local maximum}
		}
		\uElse{
			Accumulate \\(index\\) with \\(1\\) \tcp*{Increment data pointer}
			%Perform Classification on \\(W_1[index]\\)\;
		}

	}
	\BlankLine
	\caption{Spike Detection and Alignment}
	\label{aglo:T2_Detection}
\end{algorithm}

The specifics of this operation is reflected in Alg. \ref{aglo:T2_Detection}. Here the term Update, Set, and Accumulate represent recurrence, instantaneous, and integrated relations respectively. The state variable \\(V_{LFP}\\) primarily removes low frequency drift that is not associated with individual spiking events and \\(S[n]\\) is as a result a bandpass equivalent of our sampled signal \\(X[n]\\). The signal's instantaneous energy is represented by \\(ED[n]\\) which is a product of \\(S[n]\\) and the delayed derivative computed by the FIR of even order \\(2R\\) with the coefficients $a_n= -a_{2R-n} = 1-2/R \cdot(n-1)$ for \\(n\\) from \\(1\\) to \\(R\\). The factor \\(R\\) is in association with the ratio of sampling interval to spike polarization interval, equivalently as $R=f_{nyquist} / 5KHz$. At the maximum of \\(ED[n]\\) operation on line 5 essentially measures the product of the maximum spike intensity with the maximum derivative that proceeds it by \\(R\\) samples. This method primarily depends on the fact that spike detection looks for highly correlated narrow band energy which rejects a substantial amount of white noise. Moreover the operator compresses uncorrelated components in amplitude as it exhibits a square dependency in terms of $ED[n] \propto S[n]^2$ making variation in the threshold less sensitive to detection. The fact that the operator is narrow band limits the detection of slower spike waveforms that do not contain large derivative components but on the other hand this grantees more systematic alignment. In this case alignment is done simply with respect to where the peak value of \\(ES[n]\\) is detected.

{{< figure src="/images/phd-thesis/freq_pfd.svg" width="500" >}}
{{< figure src="/images/phd-thesis/phase_pfd.svg" title="Figure 52: Extracted frequency characteristics of digital filter used in Algorithm \ref{aglo:T2_Detection}." width="500" >}}


The overall filtering characteristic of \\(S[n]\\) and \\(G[n]\\) is shown in Figure 52.  The IIR bandpass is a result of \\(k_2\\) being \\(0.5\\) such that both filters suppress components around the Nyquist frequency. The group delay should be equivalent to a single high pass pole at $250   Hz$ but the FIR assists in further suppressing high and low frequency components. We should not expect significant contribution from group delay induced distortion as the features of interest will predominantly have $1 KHz - 5 KHz$ components. Besides \\(R\\) and \\(k_1\\) can always be adjusted to reposition the high-pass poles closer to DC. Note that \\(V_{LFP}\\) will represent the \\(DC-250 Hz\\) signal components that can be used to infer characteristics about the background activity.

{{< figure src="/images/phd-thesis/C05.svg" width="500" >}}
{{< figure src="/images/phd-thesis/C01.svg" width="500" >}}
{{< figure src="/images/phd-thesis/C02.svg" title="Figure 53: False alarm rates normalized by true positives for data sets with different background activity." width="500" >}}


The overall performance shown in Figure 53 reflects how spike detection is systematically accurate until the noise level approached \\(50%\\) of the signal intensity irrespective of the data set with the default case where \\(k_3=3\\). Note that the white noise is additive to the background activity implying \\(-14dB\\) of white noise and \\(-14dB\\) of background activity should evaluate to around \\(-8dB\\) accumulated SNR. When the noise level exceeds the anticipated background activity for \\(K_3\\) we observe a strong increase in the number of detected false positives. The rate of erroneously detected false negatives presents a more gradual increase but at this point classification is much more challenging. As expected background has a considerably bigger impact on false alarm rate because spectral content and signal structure is equivalent to that of the foreground activity.

We can observe that the main component for computational complexity in this detection operator arises from the FIR & IIR high-pass filters where the order is closely related to the sampling frequency. In fact if we ignore the buffer used to capture features before the alignment event then this filter accounts for 70% of the memory utilization and 53% of the elementary operations while the rest is used for evaluating the instantaneous energy and performing overhead control. Note that the classification operator should be introduced in line 19 with a sample basis using the index as referenced pointer. This implies that we will be classifying while repolarization occurs at the electrode and our detection trigger is blanked out during this interval. This implies that we lose the capacity to detect any over lapping spikes. Such events have limited occurrences and missing such events can be acceptable because proper classification will likely fail as well.

## 35 Recursive Variance Decomposition


Another commonly used technique is that of principle component analysis (PCA) which extracts the largest loading vectors \\(\nu_n\\) of the covariance matrix. This predominantly negates the systematic components of the captured signal and reduces the dimensionality of the spike window to a sub-set of maximally varying features by linear transformation. These components are particularly useful as indication for spiking activity in the signal due to structure in \\(\nu_n\\) but typically also suffice for providing a basis for classification in low noise conditions and reducing complexity once these vectors are found. The challenge specifically lies with the fact that determining this is basis requires both the computation of the covariance matrix that evolves over time as well as finding the transformation that diagonalizes that matrix. The implication is that in order to extract the first two principle components we need to track a total of \\(W(W+3)\\) state variables where \\(W\\) is the number of samples in the spike window.


The iterative method employed here referred to as recursive variance decomposition (RVD) and is an approximation to standard PCA by recursively tracking the largest two loading vectors reducing the minimum number of state variables to $3W + 3$. Similarly to PCA estimators like hebbian eigenfilters [^155], every iteration incrementally updates the the learned basis without requiring extensive computation. The methodology is based on recursive extraction of the largest loading vector $|\nu_1|$ that is normalized by \\(g_1\\) by checking the condition $(x - x \cdot \nu)\cdot \nu = 0 $. This condition checks if there is any residue in the direction of \\(\nu_1\\) after removing its component to see if it is appropriately scaled. Moreover due to the strong correlation between the mean and first principle component we approximate that $sign(\mu) \approx sign(\nu_1)$ completing the extraction of \\(\nu_1\\). In fact these two statements allow a significant reduction in complexity as normalization is achieved through feedback. The noise shaping and orthogonality properties associated with PCA is preserved using this extraction which is the most important aspect.

\begin{algorithm}
	\DontPrintSemicolon
	\KwData{Spike window \\(W_1\\)}
	\KwResult{First two aggregate loading vectors \\(\nu_1\\) & \\(\nu_2\\)}
	\Begin{
		\ForEach{Sample **n** in window \tcp*{ Projection phase } }{
			$D_1[n] = W_1[n] - \mu[n]$ \tcp*{ Get distance from mean spike }
			Accumulate \\(p_1\\) with $D_1[n] \cdot \nu_1 \cdot sign(\mu[n])$ \tcp*{ Project spike with \\(\nu_1\\) }
			Accumulate \\(p_2\\) with $D_1[n] \cdot \nu_2 $ \tcp*{ Project spike with \\(\nu_2\\) }
		}
		\;
		\ForEach{Sample **n** in window \tcp*{ Training phase } }{
			Update \\(\mu[n]\\) with $ k_1 \cdot sign(W_1[n] - \mu[n])$ \tcp*{ Track mean spike }
			Accumulate \\(\nu_{1}[n]\\) with $k_1 \cdot sign(| D_1[n]\cdot g_1 | - \nu_{1}[n])$ \tcp*{ Move \\(\nu_1\\) towards \\(D_1[n]\\) }
			Accumulate \\(\nu_{2}[n]\\) with $k_1 \cdot sign( (D_1[n] - \nu_{1}[n] \cdot p_1)\cdot g_2 - \nu_{2}[n])$ \;  \tcp*{ Move \\(\nu_2\\) towards \\(D_1[n]-p_1\cdot\nu_1\\) }
			Accumulate \\(p_3\\) with $(|D_1[n]| - \nu_{1}[n]\cdot p_1) \cdot \nu_{1}[n]$ \tcp*{ Get gain error }
			Accumulate \\(p_4\\) with $(|D_1[n]| - \nu_{2}[n]\cdot p_2) \cdot \nu_{2}[n]$ \tcp*{ Get gain error }
		}
		Accumulate \\(g_1\\) with $k_1 \cdot sign(p_3)$ \tcp*{ Adjust gain on \\(\nu_1\\) }
		Accumulate \\(g_2\\) with $k_1 \cdot sign(p_4)$ \tcp*{ Adjust gain on \\(\nu_2\\) }

	}
	\BlankLine
	\caption{Recursive variance decomposition}
	\label{aglo:T2_PC_l1min}
\end{algorithm}

Algorithm \ref{aglo:T2_PC_l1min} shows the operation for estimating the first two PCA components. Here \\(D[n]\\) is the new data point off set by the mean spike waveform $\mu [n]$ which allows the long term estimation of aggregate variance. Similarly parameter \\(k_1\\) specifies how the state variables are exponentially averaged over the preceding data points. Because the projection of the first loading vector must be evaluated before the second vector these operations must be sequenced in time or with memory buffers. The evaluation of \\(p_4\\) is strictly for illustrating the iterative method at which other components are evaluated while \\(g_2\\) can also be adjusted to normalize the values of \\(p_2\\) to prevent overflow without needing \\(p_4\\).

## 36 Template Matching using K-means


Finally we consider the implementation of template matching in channel. This can be seen as simply a K-means clustering method without dimensionality reduction on the input vector. The implication is that it is characteristically more memory intensive but requires less computationally intensive operators.

\begin{algorithm}
	\DontPrintSemicolon
	\KwData{Spike window \\(W_1\\)}
	\KwResult{Classification with respect to aggregate clusters}
	\Begin{
		Accumulate $Spike \: Count$ with \\(1\\) \tcp*{track accumulated statistics}
		\ForEach{Sample **n** in window  \tcp*{Projection Phase} }	{
			\ForEach{Template **k** in memory}	{
				Accumulate \\(p_k\\) with $W_1[n] - T_k[n] $ \tcp*{Get \\(l_1\\) distance for each spike class}
			}
		}
		Find $p_{min}=min{[|p_1|, \: |p_2|, \: |p_3|, \: |p_4|]}$ and Set \\(c\\) to index \tcp*{Find most similar}
		\ForEach{Sample **n** in window  \tcp*{Training phase} }{
			Accumulate \\(K_c[n]\\) with $k_1 \cdot sign(W_1[n] - T_c[n])$ \tcp*{ Adjust most similar class}
			\If{ Not all templates generated **and** $Spike \: Count   >   k_2$ }	{
				Duplicate exiting templates  \;
				Set $Spike \: Count$ to 0 \;
			}
		}
	}
	\BlankLine
	\caption{Incremental K-Means classification}
	\label{algo:T2_Kmean}
\end{algorithm}

The implementation considered in Algorithm \ref{algo:T2_Kmean} is relatively straightforward where one section evaluates the generation of new templates and the other adjust existing templates with new data. The template approach in general has good noise performance due to the redundancy in correlated features that average out white noise. There is some usually some concern with respect to the convergence of k-means centroids. Typically due the the fact that noisy sample points may be initialized as new clusters and thereby wasting memory. The method used here is iteratively duplicating centroids after convergence. This minimizes the impact of noisy data in the feature space. As illustrated in Figure 54 during each iteration the centroids converge to mean positions. Due to the morphology that these centroids may be in we generally need more centroids than there are clusters but this approach works well when there are few spike classes. The assumption here is that we are clustering features that are characteristically Gaussian mixtures.

{{< figure src="/images/phd-thesis/Cdup.svg" title="Figure 54: Illustration of centroid evolution over several iterations." width="500" >}}


## 37 Complexity Evaluation


Generally the application of these methods should reflect a system level objective. For the configuration used here the memory and algorithmic operations are estimated in Table 7. Multiplications are equivalent to eight elementary operations and the memory calls are not considered as a computation but as load/store cycles. The impression made here is that template matching is strictly very efficient if the the memory allows large allocation of active spike waveforms. Similarly RVD could show a considerable reduction in operations if a dedicated multiplier is introduced but that depends on how much we value compactness over execution speed. The disparity in memory requirement will dramatically worsen when the number of centroids is increased which is not the case for the computational complexity in RVD.

Table 7: Estimation on memory and computational resource requirements for each algorithm.
|		**Algorithm**   |  **Memory** | **Operations** | **cycles per sample** |
|----|----|----|----|
|		NEO Peak Detection   |  20 Elements     | 30        	| 56  |
|		RVD / training       |  63 Elements     | 29 / 88   	| 57 / 116 |
|		Template / training  |  85 Elements     | 9  / 16 		| 27 / 34 |


{{< figure src="/images/phd-thesis/P05.svg" width="500" >}}
{{< figure src="/images/phd-thesis/D05.svg" title="Figure 55: RVD and template based classification for data sets with \\(-26 dB\\) background activity." width="500" >}}


{{< figure src="/images/phd-thesis/P01.svg" width="500" >}}
{{< figure src="/images/phd-thesis/D01.svg" title="Figure 56: RVD and template based classification for data sets with \\(-20 dB\\) background activity." width="500" >}}


{{< figure src="/images/phd-thesis/P02.svg" width="500" >}}
{{< figure src="/images/phd-thesis/D02.svg" title="Figure 57: RVD and Template based classification for data sets with \\(-16 dB\\) background activity." width="500" >}}


The empirical results in Figure 56 generally show that in moderate noise conditions our classification accuracy is typically better than $85 %$ which is calculated in terms of the aggregate probability of correct classification multiplied with the probability of missing a spike event. Unsurprisingly RVD is not very effective in noisy conditions where the variance accentuates irrelevant components. The classification accuracy from template matching is also shown in Figure 57. These results should primarily show an improved noise rejection characteristic but more generally this approach is more resilient at dealing with false positives. In principle a new cluster will be assigned to a zero mean template representing the false positives while maintaining the other templates intact. Strictly the detection circuit should be readjusted to favour increased detection of false positives as long as the rate of false negatives remains low. But instead exactly the same parameters are used for every test.

We should be careful to judge the effectiveness of these implementations particularly with respect to efficiency. While we can generally increase performance by allocating more memory or introducing additional computation we need to quantitatively evaluate the objective. We suggest normalizing the resource allocation with respect to increased information extracted from the signal by classifying. That is how much more processing are we allocating for classification by proportionally increasing the signal to noise ratio of our output. In the optimistic case when the three classes neurons being detected are uncorrelated our base-line accuracy would be \\(33%\\) while needing \\(56\\) cycle of operation for spike detection. In fact this leads us to believe both algorithms in this respect decrease resource efficiency by a factor around \\(2\\) accounting for an increased memory, processing requirement compensated with increased accuracy. While this claim is very sceptical with respect to the motivation for classifying spike events it also reasons the aggressive reduction in algorithmic complexity through approximations presented here. There is genuine benefit in classification that assists the convergence of further processing algorithms. In addition we simply argue that excessive dedication of resources that exceed that needed for signal conditioning may not be worth while. They key point demonstrated here is that these methods appear very much attainable in terms of on-chip processing capacity. Here we considered the case without supervision specifically in consideration for scalability. It is likely that further reductions or optimizations can be made in that regard to the structure of these methods to improve accuracy and noise tolerance.

# 38 System Architecture


The conceptual architecture of the system proposed here is foremost based on the opportunity for software defined real-time instrumentation that has not yet be exploited in chronic implantable systems at this scale. Currently it is common place to see synthesized logic that performs all processing and data handling procedures in such a way that they have very limited high-level reconfigurability. This is strictly in order to save power and reduce complexity at the system level. It is important to note that for any recording device there are a multitude of phases during its operation where this flexibility can be highly advantageous once sensor characteristics are learned. Like discussed in Section 32 many classification algorithms benefit from training or characterizing the recording conditions.

The approach to specialized DSP in the literature reflects two problems in this field. The first is signal extraction from recordings that relates to what we have discussed in terms of spike detection to extract compressed spike train data. The other is associated with accelerating adaptive filters that map these spike trains on to estimates for cognitive dynamics or invoked limb movement. Typical examples for spike sorting are fully synthesized cores [^49] [^158] that can be integrated and are capable in achieving respectable processing capacity for specific algorithms. In contrast to spike train decoding that is predominantly performed by FPGAs as integration make less sense for the development high dimensional adaptive filters like [^159] that do not need to be embedded within body. Interestingly the work in [^40] proposes a application specific instruction set processor (ASIP) that similarly argues for high performance computation for these structures with a high degree of flexibility that reflects the different models used for spike encoding.  Additionally we see the advantage of using off-chip microcontrollers like MSP430 that interface with a highly reconfigurable instrumentation front-end to leverage both adaptive and involved noise shaping to perform more intricate operations such as seizure detection or artefact removal [^160]. While these works may not be viable for high channel count implantable devices it does highlight the considerations for designing fully integrated prosthetics that is in-line with this work.

Here we will consider a particular type of microcontroller topology that can support reconfigurable functionality and reflects the fact that although multichannel BMIs are highly parallel in nature the associated processing can also be algorithmically intensive. The feasibility of this notion has been estimated to an extent but many components are subject to implementation. In essence we optimistically approach this design problem with a strategy that exploits both the homogeneity in processing and the information locality in order to realize a feasible solution. This lets us focus on the in-channel operation where efficiency is maximized through the topology of the execution unit. Regardless of the end result this proposed system will be one step towards the goal for more effective chronic neural implants.

## 39 Distributed System


The system illustrated in Figure 58 represents the distributed microcontroller architecture. The primary mechanism of operation is the program memory that continuously feeds the stored instructions into the pipelined array of processors that operate locally on the recorded data. The execution of these instructions is handled with what is essentially a instruction decoder, memory module and an arithmetic unit that is interfaced with four analogue recording channels. This approach guarantees that the absolute minimum amount of energy is required for the communication of recorded data as the information is processed and consolidated to its elementary component at the quantization interface.

{{< figure src="/images/phd-thesis/Sys_sH.svg" title="Figure 58: Illustration of the proposed distributed \\(\mu\\)C array for homogeneous program execution at the sensor interface." width="500" >}}


Inherently this implementation will sacrifice the availability of more intricate functionality found in DSPs since the data is not funnelled into one processing unit that can be very elaborate in complexity. The distributed structure is rationalized by the fact that the intensive operations such as clustering methods operate at a much lower speed due to the sporadic spiking activity that make statistical convergence slow. Furthermore these adaptations need to be performed on the order of minutes by which such functions may also be implemented through the redundancy of elementary operations. Moreover multiplexing loses effectiveness in memory intensive applications  as it does mitigate the power & area scaling associated with memory allocation.

Also consider that the program control that gives this implementation its capability for generic computation does not scale with the number of processing units. This is an important distinction when addressing a hundreds of channels on chip that will allow this implementation to outperform any other architecture and leverage the fully integrated form factor. We also note that whether this architecture is realized by synthesized logic, FPGA fabric, or more custom logic cells is insignificant to the extent that the memory structure plays a more profound role. This claim is based on the algorithms in Section32 that allocate significantly more resources to memory than algorithmic operators. In particular memory density and efficiency is a critical component to the success of this type of large scale sensor system. Here 3-T eDRAM is employed  which is more effective than alternative solutions memory solution and can still be realized on a standard CMOS process [^161]. When compared to an SRAM equivalent we find it can readily achieve a factor 8 improvement in density [^162].


{{< figure src="/images/phd-thesis/NPI_TLT2.svg" title="Figure 59: System architecture for NPI sensing platform with digital interfaces annotated." width="500" >}}


The high-level interfaces are illustrated in Figure 59. There are multiple layers with respect to how internal resources are accessed for reconfiguration. This is primarily for robustness where each layer increases in complexity and chance of failure. The low-speed interface is the simplest element which acquires commands from an external device with very relaxed requirements on input timing. These commands allow us to reconfigure the high level sub-blocks like tuning the generated reference voltages provided by the power management, control reset/power of individual sub-blocks and selecting which digital test signals should be monitored. In particular the processor array and program memory layers almost operate in isolation to the peripherals. These blocks are timed by the internal PLL structure that drives significantly higher data rates that do not need to propagate to the pad level in order to save power. The back-end of the system similarly communicates data uni-directionally between two different clock domains to send data packets off-chip using a number of handshaking protocols.

The implementation of the analogue circuitry has been discussed in Section 23 where we additionally constrain all algorithms to a maximum of 1024 cycles per sample while maximally allocating \\(128\\) words of memory. With respect to our previous discussion this amount of hardware should allow a large set algorithms that are resource efficient. If not the topology will promote the construction of processing with more aggressive memory efficiency and using feedback dynamics to implement more complex operators such as division. It should be noted that these specifications have flexibility by sampling multiple times per program cycle or reducing the system clock using the configurable phase locked loop in order to reduce power.

{{< figure src="/images/phd-thesis/Lay_sH.svg" width="500" >}}
{{< figure src="/images/phd-thesis/Lay_sH.png" title="Figure 60: Physical implementation of NPI system using a 6-metal $0.18   \mu m$ CMOS process. " width="500" >}}


Figure 60 presents the fabricated prototype device. It can be seen that integrating many peripheral blocks such as a phase locked loop, voltage supply regulators, and program memory on chip minimizes the pad count required for the digital and power domains. However even for a 64 channels system the number of analogue pads required for the sensor interface play a significant role on top level organization. In addition careful consideration has to be made with respect to how the digital signals propagate where minimizing track length not only reduces digital noise coupled to the substrate but more significantly the associated power dissipation. The number of processing elements can in fact quite easily be scaled up by extending the instruction pipeline where the system level timing constraint for speed and fanout lies with the program memory which has an internal pipeline that needs to connect the program memory together.

## 40 Processing Core


In order to allow the hardware to provides generic processing capabilities in a distributed fashion a number of considerations have to be made. In particular we need to reflect the typical operations with certain modalities of operation. It is clear that although all recording channels should execute the same algorithm they will typically not share the same state of operation. This state dependency is exemplified with respect to intermittent processing during bursting neural activity and idling during quiet periods. This is an inherent limitation to sharing the program memory as the dynamic execution of the code where each core has its own program counter or a top level scheduler is not feasible for an arbitrary number of channels. The quasi-out-of-order execution makes it challenging for us to adopt scalable tile structures found in image processing [^105] that excel in maximizing area and power efficiency in a scalable sense.

Lead by maximizing the locality of data execution [^163] where this aspect of branch control or conditional execution is mediated by skipping a section of the incoming instructions if a condition is not met. The approach of skipping sections of code up on branching is relatively in-efficient with respect to throughput. This approach is optimal at the system level when individual cores may need to execute any section code and branching will only be limited by the dissipation related to the registers pipe-lining the instructions across the chip.

{{< figure src="/images/phd-thesis/Sys_uC.svg" title="Figure 61: Organization of the distributed execution unit detailing components and the interconnect." width="500" >}}


The individual components of the execution unit are shown in Figure 61 and details the main data buses used for exchanging data. The majority operations revolve around manipulating data in the registers R1-R16 as A operand in association with any other data sources that can be used as B operand. The operation performed by the arithmetic logic unit (ALU) will always overwrite the result to the location of the A operand but can in extension also be used to to write to other locations (i.e. memory, periphery, etc.). This implies that in terms of instructions there are always two components where the first is simply the operation executed by the ALU in addition to the two memory sources. The second component optionally extends this simple functionality by writing these intermediate values to multiple other locations or arbitrary branching operations that will take the unit out of sleep.

On that note we mention that the local execution controller consists of three registers that assist in branching operations or conditional execution. When either of these registers have logic one the instruction is gated by a null operation before execution. One of these registers will self reset allowing for if-else functionality by skipping a single instruction. The two other registers need to be cleared actively but in combination this will allow for nested conditioning of up to three levels. While in idle state no internal registers are clocked with the exception of the instruction pipeline and the branch controller saving a significant amount of power as the instruction does not need to be decoded.

The digital data interface provides the means for communicating data either off chip or to adjacent execution units. This functionality allows granular consolidation of features or signal structure and correlate measurements with system level parameters. For example consider each execution unit is listening to the most informative analogue instrumentation channel, it is conceivable that comparing its spike train with that of an adjacent units to evaluate neural interconnect level features. The Asynchronous data bus on the other hand is a key feature that allows this system to appear as a slave at the network level that does not need to be coherent with the system or off-chip clock. This bus is in essence a large buffer distributed across many channels utilising asynchronous hand-shake protocols to funnel the data to a SPI module that is clocked either externally or internally [^164]. This solves a number of coherence problems that mitigates the need of having a FPGA to drive this system as the SPI module is not timing critical. Furthermore this alleviates clock distribution as the timing constraints are always local to each execution unit and not the data bus that is distributed across the chip which may either be very restricting or power intensive.

The dynamic control with respect to the analogue channel is enabled by one designated 8 bit register per analogue channel. In this particular case 4 bits are used to specify gain, 2 bits for configuring the biasing current as 0x,1x,2x,3x, and 1 bit of the reset function. In particular the reset phase will temporarily boost the transconductance on the band-limited filtering stage to allow sub-microsecond auto-zero for active noise shaping. For both the ADC and the amplifiers there is one bit that controls a multiplexer at the input that can switch in the sensor or a global differential test net for calibration or verification. Similarly the ADC has 2 bits to select which analogue channel another bit to clock at the full rate or half the rate of ad joint micro-controller. In addition there are 3 bits to control the how the chopper frequency is divided from the sampling signal which is the final control bit. Understandably the analogue configuration will remain static after the appropriately being set. The ADC configuration register is considerably more dynamic as the multiplexer needs to be reconfigured and sampling needs to toggle persistently.

There are two modes of getting quantized data from the ADC depending on the desired functionality. The first is simply reading the 8 bit quantization register that shadows the 7bits quantized by successive approximation and the LSB from the first integration result. In order to utilize the higher resolution capability the comparator output is used to integrate coefficients from the instruction onto a local register where the comparators will decrement or increment the register accordingly. If no calibration data is locally stored this operation first integrates binary weights on one register during the SAR cycles and then integrates the FIR window onto another register. This is large investment of cycles to perform high resolution quantization but this can be optimized for specific applications when it is necessary. If the calibration data is available for the 7 SAR weights then the ADC must be configured to run at half the system speed and before quantization these weights are loaded from the memory onto registers R2-R7. Followed by the usual process of SAR quantization while these weights as simultaneously also integrated on a second register. Then after the integration phase three registers will contain quantized data. The scaling of coefficients is key and should be such that the $\Sigma \Delta$ result simply copies the sign bit of the SAR operation and can concatenate the lower 7 bits with the SAR result. Then the calibration data is scaled appropriately and added to the 14bit signed double with carry logic. Clearly there a number of conventions suggested here that will best exploit the capabilities of the design.

The memory module local to each execution unit hold 128 words of data which can be shared across the analogue channels with 32 locations each. Particularly when the DSP is mainly performing filtering the recorded data can be buffered for FIR filtering or keep its high precision filter state variables for IIR structures. These filter and program coefficients are stored in the shared program memory such that the execution unit does not experience an overhead in memory requirement. However for other memory intensive algorithms such as template matching, serving the most informative of the four analogue channels will have to suffice because the memory requirement is beyond the capabilities of this configuration. The DRAM architecture has a refresh-up-on-read mechanism which implies that the used memory locations will have to systematically be read to keep the data stored valid. Fortunately this requirement is self fulfilling as the program recycles itself every $100   \mu s$ and the DRAM retention time is on the order of $1   m s$ implying that as long as there is a guaranteed read on the memory location it will stay valid. The physical read mechanism however does require a minimum of two cycles. The first is in the background which simply prepares the internal registers of the module while a different execution is taking place and the second is in the foreground where the location is read and the data bus is driven by the DRAM.

{{< figure src="/images/phd-thesis/Lay_uC.svg" width="500" >}}
{{< figure src="/images/phd-thesis/uCm.png" title="Figure 62: Physical implementation of execution unit using a 6-metal $0.18   \mu m$ CMOS process " width="500" >}}


As the illustration in Figure 62 shows, keeping the 8 bit structure in terms of parallel operations maintains a very compact floor plan. This is typical of data flow intensive designs where the digital logic should be placed underneath the associated data buses. This is difficult to replicate by automated synthesis tool where signal congestion is the most stringent aspect. The digital signals for the two operands and the data line span horizontally where sub-blocks extensively take advantage of the gated output buffers for each sub-block that is controlled by the decoders. The full custom approach taken here sacrifices design effort for additional performance in terms of reduced parasitics and more aggressive power gating.

$$  \mbox{\textbf{\textless C\textgreater,[\textless CE\textgreater],\textless A\textgreater,\textless B\textgreater,[\textless OE\textgreater*]}} $$

The syntax for constructing instructions needs to be in the Backus Normal Form [^165] as formatted in Equation35 with reference to Table 8 which summarises all possible compositions. A parser is implemented that will translate an ordered set of these instructions directly into hardware specific machine code that needs to fed into the instruction pipeline any violations or exceptions will be caught by this script automatically. Although there is no dedicated multiplication hardware there are specialized registers that allow shift add based multiplication over eight cycles. Any other primitive logical or arithmetic function can be realized with this instruction set as it is turning-complete. This assertion is made by noting that it can evaluate the operation; subtract and branch if less than or equal to zero, which is sufficient for a one instruction set computer [^166].

Table 8: Overview of instruction sub-components.
|		**Index** | **Operation Subset** | **Summary of Possible Entries**  |
|----|----|----|
|		C        | Logic Operation:     | Logical Shift Left/Right, Arithmetic Shift Left/Right, XOR, XNOR, AND, OR, MOVE-A, MOVE-B |
|		C        | Arith. Operation:    | Compare, Add, Carry Add, Multiply, Complete Multiply  |
|		CE        | Compare Option:      | \textgreater, =, \textless, Overflow  |
|		CE        | Add Option:          | Subtract, Absolute Value, Increment Overflow Bit |
|		CE        | Mov Option:          | Mem. Address is from Data line. Default is from Instruction                               |
|		A        | Operand A:           | R1-R8, R9-R13, ID, Count, Memory   |
|		B        | Operand B:           | R1-R8, Left uC, Right uC, Instruction, ADC, Memory, Null  |
|		OE        | IO Extension:        | Write to Left uC, Write to Right uC, ADC Sample enable, Write Output Buffer   |
|		OE        | Branch Extension:    | Write to Branch Register BR1-BR3, Invert Branch Result |
|		OE        | Memory Extension:    | Write Address, Write Data, Read Data |


It should be mentioned that there a number of hardware specific details with respect to how certain instructions behave that need careful consideration towards the implementation details. For example if no comparison is made but a branch register is accessed the output of the comparator will be treated as false no matter what logic the overflow bit is. This allows us to clear or set branch registers while simultaneously performing an operation. Another example is that by default the instruction data is ready at the input of the memory address to prepare a read in the background. In most cases it is intuitive and we simply strive to maximize the cycle efficiency. At all times the execution unit is capable of dealing with the compute aspects while performing branching and memory access simultaneously.

This work also provides an elaborate set of test tools that allows compilation of instruction code and the generation of piece-wise-linear 'csv' files for test sources that can be used in the circuit simulators. This can be used in association with the transistor or verilog implementation of the processing core. The behavioural models in particular are important for the translation of this architecture to other implementations.

{{< figure src="/images/phd-thesis/uC_PS.svg" title="Figure 63: Power dissipation with respect to specific operations for the same operand A=113 & B=114 in randomized order." width="500" >}}


The results in Figure 63 exemplifies the dependency of power dissipation with respect to different operators for the same operand A and B. It should be expected that the is a strong operand dependency with respect to power consumption but these results follow our expectations closely. Generally the simpler the operation the lower the current dissipation is because less complexity is involved with the switching losses. Here again we observe that when the unit is in a sleep or branching state the power dissipation is mainly associated with the instruction pipeline. As this 32bit pipeline transverses the entire execution unite it plays a significant contribution towards the baseline power consumption. The typical power consumption for full activity will lie around $45 \mu A$ it should be noted that sporadic spiking activity will gate the majority of operations and it is likely that running at half the designed rate with 512 cycles is more than sufficient. Note the typical figure of power is \\(2.7 pJ/Cycle\\) or $2.7 \mu W/MIPS$ which is several orders of magnitude better than 16-bit microcontrollers such as the MSP-430[^167].


Table 9: Summary of performance specifications for the NPI system and state-of-the-art specialized integrated processing architectures. \\(^\star\\) Reconfigurable topology.
|		Parameter    |   Unit  | This Work | Markovic [^168] | Arimoto [^105]|
|----|----|----|----|----|
|		Architecture | | Distributed (\mu)C Array | Multi-Grain FPGA | Dedicated Tile Array|
|		Technology | [nm] | 180 | 40 | 65 |
|		Supply Voltage  | [V]		| (1.2) | (1) |  (1.2)|
|		Parallel Units		|		| (64) | (16^\star) |  (2048)|
|		Instruction Size 		| [bits] | (32) | - |  (32) |
|		Operational Frequency   | [MHz]	| (20)  | (400) |  (300) |
|		Sampling Frequency      | [S/s]		| (32k)  | (100M) | - |
|		Operations per Sample	| [Cycles]		| $256 $ | (4) |  - |
|		(P_{Digital}) per Channel   | [(\mu)A]	| (44) | - |  -|
|		(P_{Analogue}) per Channel | [(\mu)A] | (16) | - |  - |
|		System Power  | [mA]	| (1.42) | $ 11.6 $ |  (300)|
|		Program Memory Capacity    	| [kb]	| (32)  | - |  - |
|		Processor Memory Capacity  	| [kb]	| (1)  | (36) |  (1) |
|		Processor Array Area   | [mm(^2)] | $1.04 \times 1.32$ | $3.8 \times 5.4$ | $1.60 \times 3.19$ |
|		Power Efficiency | [GOPS/mW] | (1.52) | $ 0.86 $ |  (0.31) |
|		Area Efficiency | [GOPS/mm(^2)] | (0.88) | $ 2.34$ |  (36.1)|


The specifications given in Table 9 summarize the main features associated with this system on chip for processing neural data at the sensor interface. As the total power consumption is on the order of \\(1.5 mW\\) there is some concern with respect to the power density associated with the system in full operation that in this particular case is \\(26 mW/cm^2\\). In fact if the number of channels is scaled up beyond 64 channels this power density will tend to \\(29 mW/cm^2\\) but will not exceed it. Either figure will likely be smaller subject to the physical & software implementations but more importantly will not result in a thermal agitation or the heating of cortical tissue that exceeds \\(2^{\circ}C\\) [^68]. More generally we have the advantage of tuning processing capabilities to the heat-capacity of the implanted package. In fact comparing this work to state of the art FPGA topologies[^168] and highly parallel ASIC structures[^105] that follow the same design methodology we find that power and area efficiency that exceeds that of stand-alone microprocessors by orders of magnitude. These figures also reflect the expectation that technology scaling should lead to even more compact configurations. In addition Gate leakage may introduces some diminishing returns with respect to power efficiency. We mention that these figures are extrapolated based on the performance of a single execution unit and we expect more overhead from other components that is not accounted for in this comparison.

$$  R_{D} = \frac{P_{\mu C} \cdot A_{\mu C}}{N^{2}_{chan} \cdot Cycles} = \frac{44 \mu W \cdot 196 \times 158 \mu m^2} {4^2 \cdot 256}\approx 3.3 \cdot 10^{-16} \: \left[W   mm^2 \: per \: OP \right] $$

Re-evaluating our power/area figure of merit in Figure 49 with Equation 36 we observe that practically we lose a factor of ten in efficiency when compared to a dedicated ASIC implementation because resource utilization inside the execution unit can not be maximized. This was expected given that instead we attain high-level reconfigurability. However this does achieve a very good understanding with regard to where the system scales from this point both with respect to area and power requirements.

## 41 Testing Platform


As this system is directed at generic use for the neuroscience community where high level programming and interfaces are essential for end user adoption. The testing platform presented here is aligned in such a fashion that its fundamental components can be extended upon greatly to serve a multitude of needs. This ambitious design criteria is primarily provided by the real-time platform illustrated in Figure 64 that supports a standard Linux operating system. The thee components compromise of the custom NPI system on chip, the Raspberry Pi platform, and networked resources.

{{< figure src="/images/phd-thesis/Sys_iP.svg" title="Figure 64: Block diagram of the instrumentation platform developed as framework for real-time applications." width="500" >}}


The software stack running on the Raspberry Pi primarily handles the high speed SPI link that fetches data from the NPI system at \\(10 Mb/s\\) and stores it to a local buffer for some of the data visualization. This data stream is then forwarded to a network routine that is connected to a server over the local area network via a UDP protocol to allow large quantities of data to be stored in a scalable fashion. The graphical user interface is built on top of this process in order to give a means to both configure the device actively and provide some form of interactive interrogation with respect to the recorded data and the algorithm being executed.

The application of a generic internet of things platform plays a important role with respect to long term development objectives. It signifies that the ASIC is there to provide a specialized interface with the sensor and a generic digital interface with the external control to allow rapid adoption of new techniques or other components as software extensions. This substantiates the modular approach where design effort is explicitly focused towards specialized hardware for the sensor and software development at the system level. This is important given the complexity of these systems where overspecialisation limits the versatility of existing designs thereby limiting the utility of other commercially available tools/devices.

The advantage here is that a multitude of procedures can be run on the real-time platform without supervision that are detailed in high-level programming code that have fast development and turn-around capabilities. In this case it significantly improved test procedures by enabling automated exhaustive characterization of logical integrity. In fact the standalone module of the microcontroller structure can run 1 MIPS of on the fly randomly generated operations. This can be seen in Figure 65 where the Saleae logic analyser is used to probe the internal data bus of one particular core.

{{< figure src="/images/phd-thesis/Scope.png" title="Figure 65: Digital waveform of the internal data bus BIT 1-8 as new instructions are being loaded into the device using the clocked Latch and Configure signals." width="500" >}}


Table 10: Section of Instructions and recorded outputs from $\mu C$ structure with the associated machine code.
|		**BITLINE**|  **INSTRUCTION** | **Machine Code**  |
|----|----|----|
|		00011010  |  MOVB R5 DINST 26    |0011111100000000011000000011010|
|		11101111  |  MOVA R3 DINST -17   |0011011000000000011010011101111|
|		00001010  |  AND R5 R3           |1110111100000000000000000000000|
|		00011010  |  MOVB R5 DINST 26    |0011111100000000011000000011010|
|		11101111  |  MOVA R3 DINST -17   |0011011000000000011010011101111|
|		11111111  |  OR R5 R3            |1110111100000000000010000000000|
|		00011010  |  MOVB R5 DINST 26    |0011111100000000011000000011010|
|		11101111  |  MOVA R3 DINST -17   |0011011000000000011010011101111|
|		11110101  |  XOR R5 R3           |1110111100000000000100000000000|


This is partly shown in Table 10 where the internal bit-line of one such execution unit could be directly accessed. Because it is not viable for us to exhaustively simulate the hardware in various conditions we use a physical test bench in order to record the performance tolerance with respect to voltage supply and operating frequency. Moreover what the user sees is reduced to latent frames of data over several milliseconds and the corresponding instruction code executed by the platform. The physical interfacing protocols are very much transparent. By construction each core has a hard wired ID that will allow the active supervision of internal variables for development and debugging of single units. Due to the specialized hardware the instrumentation programs currently still require careful tailoring of the instruction code but this can be extended towards compiling directly from C++ code that is also used to construct the rest of the platform.

{{< figure src="/images/phd-thesis/TPlat.svg" title="Figure 66: Graphical user interface used for configuring the NPI system showing test data." width="500" >}}


Figure 66 depicts the GPU accelerated graphical set-up used for testing the device where the functionality is mainly associated with reconfiguration and powering different system sub-blocks for validation. From a engineering point of view it is more of a convenience to have automated reconfiguration of the device as one interacts with the various settings. Particularly in associated with probing the supply voltages or analogue reference signals generated on chip. It would be more typical that during experimentation this functionality can be reduced to simply selecting from a set of predetermined programs.

{{< figure src="/images/phd-thesis/TPhw.svg" title="Figure 67: Test bed used for characterization with various components illustrated." width="500" >}}


In order to move towards fully isolated operation which will be the case for a implanted device the system on chip architecture relies on a minimum amount of off-chip components in order to bring the resource requirements of the topology into scope. This is shown in Figure 67. These feasibility considerations are generally with respect to reasonable assumptions associated with a wireless implant that is hermetically sealed. In this particular case we will allow a number of off-chip decoupling capacitors, a reference resistor and a reference voltage which may very well be integrated on chip in one way or another if necessary. The system also uses a \\(1 MHz\\) external clock reference which may be realized at the wireless power carrier frequency and is locked onto with a phase locked loop to generate the internal \\(20 MHz\\) system clock. Three linear LDOs were integrated to provide a \\(1.2 V\\) supply to the digital,analogue, and memory separately. Where the analogue supply voltage used to derive internal ADC voltages references of \\(1.2 V,0.9 V,0.6 V,0.3 V\\) from the unregulated supply using high speed buffers.

# 42 Conclusion

This chapter substantiates a scalable and long-term approach for the development of programmable neural interfaces. In particular we discuss why moving away from the fixed purpose DSP architectures seen in many conventional systems is of significance with respect to performance and reliability. In addition we provide indicators that show the majority of modern CMOS technologies using dedicated on-chip processing hardware is viable to perform local signal analysis. Furthermore we highlight the importance of efficient algorithm construction were operators should revolve around execution per sample and processing structures that improve scalability for systems with many recording channels in association with the near-data-processing paradigm. PCA & template maching methods are proposed for embedded systems that require 57 operations per sample and 680 bits of memory with entirely unsupervised operation that can achieve over 80% accuracy during spike detection & classification.

A distributed micro-controller structure is proposed in effort to realize these characteristics and reveal underlying constraints. The topology reflects the nature of processing neural data in the context of achieving generic computational capacity. This discussion details both low-level and system level considerations that address the software stack. The impact of memory requirement that results from being able to execute arbitrary algorithms in isolation is evident both in-channel and chip level. In the proposed configuration the amount of resources allocated for this function is comparable to that of the signal processing but depends very much on the number of channels that are integrated together. We point out that if the number of channels is increased this component does not change and allows this topology to become more effective. The distributed processing architecture operates with an efficiency of 1.52 GOPS/mW and each core only requires a 0.02mm\\(^2\\) silicon foot print with fully reconfigurable 8 bit processing capabilities.

The foregoing discussion has depicted the intricate complexity associated with these sensing systems and revealed the diversity of aspects that should be taken into consideration. Sustainable development for these systems will need long-term solutions due to the excessive design effort that prevents rapid turn around and progress. Moreover innovation needs to be contextualized at the system level to ascertain whether new techniques and methods have significant impact. This requires the abstraction and modelling of these implementations to gauge impact using empirical indicators.

# References:

[^1]: R.Q. Quiroga, Z.Nadasdy, and Y.Ben-Shaul, ''Unsupervised spike detection and  sorting with wavelets and superparamagnetic clustering,'' Neural  Computation, vol.16, pp. 1661--1687, April 2004. [Online]:  http://dx.doi.org/10.1162/089976604774201631
[^2]: R.A. Normann, ''Technology insight: future neuroprosthetic therapies for  disorders of the nervous system,'' Nature Clinical Practice Neurology,  vol.3, pp. 444--452, August 2007. [Online]:  http://dx.doi.org/10.1038/ncpneuro0556
[^3]: K.Birmingham, V.Gradinaru, P.Anikeeva, W.M. Grill, B.Pikov,  VictorMcLaughlin, P.Pasricha, K.Weber, DouglasLudwig, and K.Famm,  ''Bioelectronic medicines: a research roadmap,'' Nature Reviews Drug  Discovery, vol.13, pp. 399--400, May 2014. [Online]:  http://dx.doi.org/10.1038/nrd4351
[^4]: ''Bridging the bio-electronic divide,'' Defense Advanced Research Projects  Agency, Arlington, Texas, January 2016. [Online]:  http://www.darpa.mil/news-events/2015-01-19
[^5]: G.Fritsch and E.Hitzig, ''ber die elektrische erregbarkeit des grosshirns,''  Archiv für Anatomie, Physiologie und Wissenschaftliche Medicin.,  vol.37, pp. 300--332, 1870.
[^6]: G.E. Loeb, ''Cochlear prosthetics,'' Annual Review of Neuroscience,  vol.13, no.1, pp. 357--371, 1990, pMID: 2183680. [Online]:  http://dx.doi.org/10.1146/annurev.ne.13.030190.002041
[^7]: ''Annual update bcig uk cochlear implant provision,'' British Cochlear Implant  Group, London WC1X 8EE, UK, pp. 1--2, March 2015. [Online]:  http://www.bcig.org.uk/wp-content/uploads/2015/12/CI-activity-2015.pdf
[^8]: M.Alexander, ''Neuro-numbers,'' Association of British Neurologists (ABN),  London SW9 6WY, UK, pp. 1--12, April 2003. [Online]:  http://www.neural.org.uk/store/assets/files/20/original/NeuroNumbers.pdf
[^9]: A.Jackson and J.B. Zimmermann, ''Neural interfaces for the brain and spinal  cord — restoring motor function,'' Nature Reviews Neurology, vol.8,  pp. 690--699, December 2012. [Online]:  http://dx.doi.org/10.1038/nrneurol.2012.219
[^10]: M.Gilliaux, A.Renders, D.Dispa, D.Holvoet, J.Sapin, B.Dehez,  C.Detrembleur, T.M. Lejeune, and G.Stoquart, ''Upper limb robot-assisted  therapy in cerebral palsy: A single-blind randomized controlled trial,''  Neurorehabilitation AND Neural Repair, vol.29, no.2, pp. 183--192,  February 2015. [Online]:  http://nnr.sagepub.com/content/29/2/183.abstract
[^11]: P.Osten and T.W. Margrie, ''Mapping brain circuitry with a light  microscope,'' Nature Methods, vol.10, pp. 515--523, June 2013.  [Online]: http://dx.doi.org/10.1038/nmeth.2477
[^12]: S.M. Gomez-Amaya, M.F. Barbe, W.C. deGroat, J.M. Brown, J.Tuite, Gerald  F.ANDCorcos, S.B. Fecho, A.S. Braverman, and M.R. RuggieriSr, ''Neural  reconstruction methods of restoring bladder function,'' Nature Reviews  Urology, vol.12, pp. 100--118, February 2015. [Online]:  http://dx.doi.org/10.1038/nrurol.2015.4
[^13]: H.Yu, W.Xiong, H.Zhang, W.Wang, and Z.Li, ''A parylene self-locking cuff  electrode for peripheral nerve stimulation and recording,'' IEEE/ASME  Journal of Microelectromechanical Systems, vol.23, no.5, pp. 1025--1035,  Oct 2014. [Online]: http://dx.doi.org/10.1109/JMEMS.2014.2333733
[^14]: J.S. Ho, S.Kim, and A.S.Y. Poon, ''Midfield wireless powering for  implantable systems,'' Proceedings of the IEEE, vol. 101, no.6, pp.  1369--1378, June 2013. [Online]:  http://dx.doi.org/10.1109/JPROC.2013.2251851
[^15]: R.D. KEYNES, ''Excitable membranes,'' Nature, vol. 239, pp. 29--32,  September 1972. [Online]: http://dx.doi.org/10.1038/239029a0
[^16]: A.D. Grosmark and G.Buzs\'aki, ''Diversity in neural firing dynamics  supports both rigid and learned hippocampal sequences,'' Science, vol.  351, no. 6280, pp. 1440--1443, March 2016. [Online]:  http://science.sciencemag.org/content/351/6280/1440
[^17]: B.Sakmann and E.Neher, ''Patch clamp techniques for studying ionic channels  in excitable membranes,'' Annual Review of Physiology, vol.46, no.1,  pp. 455--472, October 1984, pMID: 6143532. [Online]:  http://dx.doi.org/10.1146/annurev.ph.46.030184.002323
[^18]: M.P. Ward, P.Rajdev, C.Ellison, and P.P. Irazoqui, ''Toward a comparison of  microelectrodes for acute and chronic recordings,'' Brain Research,  vol. 1282, pp. 183 -- 200, July 2009. [Online]:  http://www.sciencedirect.com/science/article/pii/S0006899309010841
[^19]: J.E.B. Randles, ''Kinetics of rapid electrode reactions,'' Discuss.  Faraday Soc., vol.1, pp. 11--19, 1947. [Online]:  http://dx.doi.org/10.1039/DF9470100011
[^20]: M.E. Spira and A.Hai, ''Multi-electrode array technologies for neuroscience  and cardiology,'' Nature Nanotechnology, vol.8, pp. 83 -- 94,  February 2013. [Online]: http://dx.doi.org/10.1038/nnano.2012.265
[^21]: G.E. Moore, ''Cramming more components onto integrated circuits,''  Proceedings of the IEEE, vol.86, no.1, pp. 82--85, January 1998.  [Online]: http://dx.doi.org/10.1109/JPROC.1998.658762
[^22]: I.Ferain, C.A. Colinge, and J.-P. Colinge, ''Multigate transistors as the  future of classical metal-oxide-semiconductor field-effect transistors,''  Nature, vol. 479, pp. 310--316, November 2011. [Online]:  http://dx.doi.org/10.1038/nature10676
[^23]: I.H. Stevenson and K.P. Kording, ''How advances in neural recording affect  data analysis,'' Nature neuroscience, vol.14, no.2, pp. 139--142,  February 2011. [Online]: http://dx.doi.org/10.1038/nn.2731
[^24]: C.Thomas, P.Springer, G.Loeb, Y.Berwald-Netter, and L.Okun, ''A miniature  microelectrode array to monitor the bioelectric activity of cultured cells,''  Experimental cell research, vol.74, no.1, pp. 61--66, September  1972. [Online]: http://dx.doi.org/0.1016/0014-4827(72)90481-8
[^25]: R.A. Andersen, E.J. Hwang, and G.H. Mulliken, ''Cognitive neural  prosthetics,'' Annual review of Psychology, vol.61, pp. 169--190,  December 2010, pMID: 19575625. [Online]:  http://dx.doi.org/10.1146/annurev.psych.093008.100503
[^26]: L.A. Jorgenson, W.T. Newsome, D.J. Anderson, C.I. Bargmann, E.N. Brown,  K.Deisseroth, J.P. Donoghue, K.L. Hudson, G.S. Ling, P.R. MacLeish  etal., ''The brain initiative: developing technology to catalyse  neuroscience discovery,'' Philosophical Transactions of the Royal  Society of London B: Biological Sciences, vol. 370, no. 1668, p. 20140164,  2015.
[^27]: E.DAngelo, G.Danese, G.Florimbi, F.Leporati, A.Majani, S.Masoli,  S.Solinas, and E.Torti, ''The human brain project: High performance  computing for brain cells hw/sw simulation and understanding,'' in  Proceedings of the Digital System Design Conference, August 2015, pp.  740--747. [Online]: http://dx.doi.org/10.1109/DSD.2015.80
[^28]: K.Famm, B.Litt, K.J. Tracey, E.S. Boyden, and M.Slaoui, ''Drug discovery:  a jump-start for electroceuticals,'' Nature, vol. 496, no. 7444, pp.  159--161, April 2013. [Online]: http://dx.doi.org/0.1038/496159a
[^29]: K.Deisseroth, ''Optogenetics,'' Nature methods, vol.8, no.1, pp.  26--29, January 2011. [Online]: http://dx.doi.org/10.1038/nmeth.f.324
[^30]: M.Velliste, S.Perel, M.C. Spalding, A.S. Whitford, and A.B. Schwartz,  ''Cortical control of a prosthetic arm for self-feeding,'' Nature,  vol. 453, no. 7198, pp. 1098--1101, June 2008. [Online]:  http://dx.doi.org/10.1038/nature06996
[^31]: T.N. Theis and P.M. Solomon, ''In quest of the "next switch" prospects for  greatly reduced power dissipation in a successor to the silicon field-effect  transistor,'' Proceedings of the IEEE, vol.98, no.12, pp.  2005--2014, December 2010. [Online]:  http://dx.doi.org/10.1109/JPROC.2010.2066531
[^32]: G.M. Amdahl, ''Validity of the single processor approach to achieving large  scale computing capabilities, reprinted from the afips conference  proceedings, vol. 30 (atlantic city, n.j., apr. 18-20), afips press, reston,  va., 1967, pp. 483-485, when dr. amdahl was at international business  machines corporation, sunnyvale, california,'' in AFIPS Conference  Proceedings, Vol. 30 (Atlantic City, N.J., Apr. 18-20), vol.12,  no.3.\hskip 1em plus 0.5em minus 0.4em
elax IEEE, Summer 2007, pp. 19--20.  [Online]: http://dx.doi.org/0.1109/N-SSC.2007.4785615
[^33]: J.G. Koller and W.C. Athas, ''Adiabatic switching, low energy computing, and  the physics of storing and erasing information,'' in IEEE Proceedings  of the Workshop on Physics and Computation.\hskip 1em plus 0.5em minus  0.4em
elax IEEE, October 1992, pp. 267--270. [Online]:  http://dx.doi.org/10.1109/PHYCMP.1992.615554
[^34]: E.P. DeBenedictis, J.E. Cook, M.F. Hoemmen, and T.S. Metodi, ''Optimal  adiabatic scaling and the processor-in-memory-and-storage architecture (oas  :pims),'' in IEEE Proceedings of the International Symposium on  Nanoscale Architectures.\hskip 1em plus 0.5em minus 0.4em
elax IEEE, July  2015, pp. 69--74. [Online]:  http://dx.doi.org/10.1109/NANOARCH.2015.7180589
[^35]: S.Houri, G.Billiot, M.Belleville, A.Valentian, and H.Fanet, ''Limits of  cmos technology and interest of nems relays for adiabatic logic  applications,'' IEEE Transactions on Circuits and Systems---Part I:  Fundamental Theory and Applications, vol.62, no.6, pp. 1546--1554, June  2015. [Online]: http://dx.doi.org/10.1109/TCSI.2015.2415177
[^36]: S.K. Arfin and R.Sarpeshkar, ''An energy-efficient, adiabatic electrode  stimulator with inductive energy recycling and feedback current regulation,''  IEEE Transactions on Biomedical Circuits and Systems, vol.6, no.1,  pp. 1--14, February 2012. [Online]:  http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6036003&isnumber=6138606
[^37]: P.R. Kinget, ''Scaling analog circuits into deep nanoscale cmos: Obstacles and  ways to overcome them,'' in IEEE Proceedings of the Custom Integrated  Circuits Conference.\hskip 1em plus 0.5em minus 0.4em
elax IEEE, September  2015, pp. 1--8. [Online]: http://dx.doi.org/10.1109/CICC.2015.7338394
[^38]: K.Bernstein, D.J. Frank, A.E. Gattiker, W.Haensch, B.L. Ji, S.R. Nassif,  E.J. Nowak, D.J. Pearson, and N.J. Rohrer, ''High-performance cmos  variability in the 65-nm regime and beyond,'' IBM Journal of Research  AND Development, vol.50, no. 4.5, pp. 433--449, July 2006. [Online]:  http://dx.doi.org/10.1147/rd.504.0433
[^39]: L.L. Lewyn, T.Ytterdal, C.Wulff, and K.Martin, ''Analog circuit design in  nanoscale cmos technologies,'' Proceedings of the IEEE, vol.97,  no.10, pp. 1687--1714, October 2009. [Online]:  http://dx.doi.org/10.1109/JPROC.2009.2024663
[^40]: Y.Xin, W.X.Y. Li, Z.Zhang, R.C.C. Cheung, D.Song, and T.W. Berger, ''An  application specific instruction set processor (asip) for adaptive filters in  neural prosthetics,'' IEEE/ACM Transactions on Computational Biology  and Bioinformatics, vol.12, no.5, pp. 1034--1047, September 2015.  [Online]: http://dx.doi.org/10.1109/TCBB.2015.2440248
[^41]: G.Schalk, P.Brunner, L.A. Gerhardt, H.Bischof, and J.R. Wolpaw,  ''Brain-computer interfaces (bcis): detection instead of classification,''  Journal of neuroscience methods, vol. 167, no.1, pp. 51--62, 2008,  brain-Computer Interfaces (BCIs). [Online]:  http://www.sciencedirect.com/science/article/pii/S0165027007004116
[^42]: Z.Li, J.E. O'Doherty, T.L. Hanson, M.A. Lebedev, C.S. Henriquez, and M.A.  Nicolelis, ''Unscented kalman filter for brain-machine interfaces,''  PloS one, vol.4, no.7, pp. 1--18, 2009. [Online]:  http://dx.doi.org/10.1371/journal.pone.0006243
[^43]: A.L. Orsborn, H.G. Moorman, S.A. Overduin, M.M. Shanechi, D.F. Dimitrov,  and J.M. Carmena, ''Closed-loop decoder adaptation shapes neural plasticity  for skillful neuroprosthetic control,'' Neuron, vol.82, pp. 1380 --  1393, March 2016. [Online]:  http://dx.doi.org/10.1016/j.neuron.2014.04.048
[^44]: Y.Yan, X.Qin, Y.Wu, N.Zhang, J.Fan, and L.Wang, ''A restricted boltzmann  machine based two-lead electrocardiography classification,'' in IEEE  Proceedings of the International Conference on Wearable and Implantable Body  Sensor Networks.\hskip 1em plus 0.5em minus 0.4em
elax IEEE, June 2015, pp.  1--9. [Online]: http://dx.doi.org/10.1109/BSN.2015.7299399
[^45]: B.M. Yu and J.P. Cunningham, ''Dimensionality reduction for large-scale  neural recordings,'' Nature Neuroscience, vol.17, pp. 1500 -- 1509,  November 2014. [Online]: http://dx.doi.org/10.1038/nn.3776
[^46]: S.Makeig, C.Kothe, T.Mullen, N.Bigdely-Shamlo, Z.Zhang, and  K.Kreutz-Delgado, ''Evolving signal processing for brain: Computer  interfaces,'' Proceedings of the IEEE, vol. 100, no. Special  Centennial Issue, pp. 1567--1584, May 2012. [Online]:  http://dx.doi.org/10.1109/JPROC.2012.2185009
[^47]: G.Indiveri and S.C. Liu, ''Memory and information processing in neuromorphic  systems,'' Proceedings of the IEEE, vol. 103, no.8, pp. 1379--1397,  August 2015. [Online]: http://dx.doi.org/10.1109/JPROC.2015.2444094
[^48]: Y.Chen, E.Yao, and A.Basu, ''A 128-channel extreme learning machine-based  neural decoder for brain machine interfaces,'' IEEE Transactions on  Biomedical Circuits and Systems, vol.10, no.3, pp. 679--692, June 2016.  [Online]: http://dx.doi.org/10.1109/TBCAS.2015.2483618
[^49]: V.Karkare, S.Gibson, and D.Marković, ''A 75- $\mu$w, 16-channel neural  spike-sorting processor with unsupervised clustering,'' IEEE Journal  of Solid-State Circuits, vol.48, no.9, pp. 2230--2238, September 2013.  [Online]: http://dx.doi.org/10.1109/JSSC.2013.2264616
[^50]: T.C. Chen, W.Liu, and L.G. Chen, ''128-channel spike sorting processor with  a parallel-folding structure in 90nm process,'' in IEEE Proceedings  of the International Symposium on Circuits and Systems, May 2009, pp.  1253--1256. [Online]: http://dx.doi.org/10.1109/ISCAS.2009.5117990
[^51]: G.Baranauskas, ''What limits the performance of current invasive brain machine  interfaces?'' Frontiers in Systems Neuroscience, vol.8, no.68, April  2014. [Online]:  http://www.frontiersin.org/systems_neuroscience/10.3389/fnsys.2014.00068
[^52]: E.F. Chang, ''Towards large-scale, human-based, mesoscopic  neurotechnologies,'' Neuron, vol.86, pp. 68--78, March 2016.  [Online]: http://dx.doi.org/10.1016/j.neuron.2015.03.037
[^53]: M.A.L. Nicolelis and M.A. Lebedev, ''Principles of neural ensemble  physiology underlying the operation of brain-machine,'' Nature Reviews  Neuroscience, vol.10, pp. 530--540, July 2009. [Online]:  http://dx.doi.org/10.1038/nrn2653
[^54]: Z.Fekete, ''Recent advances in silicon-based neural microelectrodes and  microsystems: a review,'' Sensors AND Actuators B: Chemical, vol. 215,  pp. 300 -- 315, 2015. [Online]:  http://www.sciencedirect.com/science/article/pii/S092540051500386X
[^55]: N.Saeidi, M.Schuettler, A.Demosthenous, and N.Donaldson, ''Technology for  integrated circuit micropackages for neural interfaces, based on  gold–silicon wafer bonding,'' Journal of Micromechanics AND  Microengineering, vol.23, no.7, p. 075021, June 2013. [Online]:  http://stacks.iop.org/0960-1317/23/i=7/a=075021
[^56]: K.Seidl, S.Herwik, T.Torfs, H.P. Neves, O.Paul, and P.Ruther,  ''Cmos-based high-density silicon microprobe arrays for electronic depth  control in intracortical neural recording,'' IEEE Journal of  Microelectromechanical Systems, vol.20, no.6, pp. 1439--1448, December  2011. [Online]:  http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6033040&isnumber=6075219
[^57]: T.D.Y. Kozai, N.B. Langhals, P.R. Patel, X.Deng, H.Zhang, K.L. Smith,  J.Lahann, N.A. Kotov, and D.R. Kipke, ''Ultrasmall implantable composite  microelectrodes with bioactive surfaces for chronic neural interfaces,''  Nature Materials, vol.11, pp. 1065--1073, December 2012. [Online]:  http://dx.doi.org/10.1038/nmat3468
[^58]: D.A. Schwarz, M.A. Lebedev, T.L. Hanson, D.F. Dimitrov, G.Lehew, J.Meloy,  S.Rajangam, V.Subramanian, P.J. Ifft, Z.Li, A.Ramakrishnan, A.Tate,  K.Z. Zhuang, and M.A.L. Nicolelis, ''Chronic, wireless recordings of  large-scale brain activity in freely moving rhesus monkeys,'' Nature  Methods, vol.11, pp. 670--676, April 2014. [Online]:  http://dx.doi.org/10.1038/nmeth.2936
[^59]: P.Ruther, S.Herwik, S.Kisban, K.Seidl, and O.Paul, ''Recent progress in  neural probes using silicon mems technology,'' IEEJ Transactions on  Electrical and Electronic Engineering, vol.5, no.5, pp. 505--515, 2010.  [Online]: http://dx.doi.org/10.1002/tee.20566
[^60]: H.-W. Kang, S.J. Lee, I.K. Ko, C.Kengla, J.J. Yoo, and A.Atala, ''A 3d  bioprinting system to produce human-scale tissue constructs with structural  integrity,'' Nature Biotechnology, vol.34, pp. 312--319, March 2016.  [Online]: http://dx.doi.org/10.1038/nbt.3413
[^61]: C.Xie, J.Liu, T.-M. Fu, X.Dai, W.Zhou, and C.M. Lieber,  ''Three-dimensional macroporous nanoelectronic networks as minimally invasive  brain probes,'' Nature Materials, vol.14, pp. 1286--1292, May 2015.  [Online]: http://dx.doi.org/10.1038/nmat4427
[^62]: R.R. Harrison, P.T. Watkins, R.J. Kier, R.O. Lovejoy, D.J. Black,  B.Greger, and F.Solzbacher, ''A low-power integrated circuit for a wireless  100-electrode neural recording system,'' IEEE Journal of Solid-State  Circuits, vol.42, no.1, pp. 123--133, Jan 2007. [Online]:  http://dx.doi.org/10.1109/JSSC.2006.886567
[^63]: J.Guo, W.Ng, J.Yuan, S.Li, and M.Chan, ''A 200-channel  area-power-efficient chemical and electrical dual-mode acquisition ic for the  study of neurodegenerative diseases,'' IEEE Transactions on  Biomedical Circuits and Systems, vol.10, no.3, pp. 567--578, June 2016.  [Online]: http://dx.doi.org/10.1109/TBCAS.2015.2468052
[^64]: W.Biederman, D.J. Yeager, N.Narevsky, J.Leverett, R.Neely, J.M. Carmena,  E.Alon, and J.M. Rabaey, ''A 4.78 mm 2 fully-integrated neuromodulation soc  combining 64 acquisition channels with digital compression and simultaneous  dual stimulation,'' IEEE Journal of Solid-State Circuits, vol.50,  no.4, pp. 1038--1047, April 2015. [Online]:  http://dx.doi.org/10.1109/JSSC.2014.2384736
[^65]: R.Muller, S.Gambini, and J.M. Rabaey, ''A 0.013mm$^2$, $5 \mu w$,  dc-coupled neural signal acquisition ic with 0.5v supply,'' IEEE  Journal of Solid-State Circuits, vol.47, no.1, pp. 232--243, Jan 2012.  [Online]: http://dx.doi.org/10.1109/JSSC.2011.2163552
[^66]: H.Kassiri, A.Bagheri, N.Soltani, K.Abdelhalim, H.M. Jafari, M.T. Salam,  J.L.P. Velazquez, and R.Genov, ''Battery-less tri-band-radio neuro-monitor  and responsive neurostimulator for diagnostics and treatment of neurological  disorders,'' IEEE Journal of Solid-State Circuits, vol.51, no.5,  pp. 1274--1289, May 2016. [Online]:  http://dx.doi.org/10.1109/JSSC.2016.2528999
[^67]: M.Ballini, J.Müller, P.Livi, Y.Chen, U.Frey, A.Stettler, A.Shadmani,  V.Viswam, I.L. Jones, D.Jäckel, M.Radivojevic, M.K. Lewandowska,  W.Gong, M.Fiscella, D.J. Bakkum, F.Heer, and A.Hierlemann, ''A  1024-channel cmos microelectrode array with 26,400 electrodes for recording  and stimulation of electrogenic cells in vitro,'' IEEE Journal of  Solid-State Circuits, vol.49, no.11, pp. 2705--2719, Nov 2014. [Online]:  http://dx.doi.org/10.1109/JSSC.2014.2359219
[^68]: P.D. Wolf, Thermal considerations for the design of an implanted  cortical brain--machine interface (BMI).\hskip 1em plus 0.5em minus  0.4em
elax CRC Press Boca Raton, FL, 2008, pMID: 21204402. [Online]:  http://www.ncbi.nlm.nih.gov/books/NBK3932
[^69]: T.Denison, K.Consoer, W.Santa, A.T. Avestruz, J.Cooley, and A.Kelly, ''A  2 $\mu$w 100 nv/rthz chopper-stabilized instrumentation amplifier for chronic  measurement of neural field potentials,'' IEEE Journal of Solid-State  Circuits, vol.42, no.12, pp. 2934--2945, December 2007. [Online]:  http://dx.doi.org/10.1109/JSSC.2007.908664
[^70]: B.Johnson, S.T. Peace, A.Wang, T.A. Cleland, and A.Molnar, ''A 768-channel  cmos microelectrode array with angle sensitive pixels for neuronal  recording,'' IEEE Sensors Journal, vol.13, no.9, pp. 3211--3218,  Sept 2013. [Online]: http://dx.doi.org/10.1109/JSEN.2013.2266894
[^71]: C.M. Lopez, A.Andrei, S.Mitra, M.Welkenhuysen, W.Eberle, C.Bartic,  R.Puers, R.F. Yazicioglu, and G.G.E. Gielen, ''An implantable  455-active-electrode 52-channel cmos neural probe,'' IEEE Journal of  Solid-State Circuits, vol.49, no.1, pp. 248--261, January 2014. [Online]:  http://dx.doi.org/10.1109/JSSC.2013.2284347
[^72]: J.Scholvin, J.P. Kinney, J.G. Bernstein, C.Moore-Kochlacs, N.Kopell, C.G.  Fonstad, and E.S. Boyden, ''Close-packed silicon microelectrodes for  scalable spatially oversampled neural recording,'' IEEE Transactions  on Biomedical Engineering, vol.63, no.1, pp. 120--130, Jan 2016. [Online]:  http://dx.doi.org/10.1109/TBME.2015.2406113
[^73]: M.Han, B.Kim, Y.A. Chen, H.Lee, S.H. Park, E.Cheong, J.Hong, G.Han, and  Y.Chae, ''Bulk switching instrumentation amplifier for a high-impedance  source in neural signal recording,'' IEEE Transactions on Circuits  and Systems---Part II: Express Briefs, vol.62, no.2, pp. 194--198, Feb  2015. [Online]: http://dx.doi.org/10.1109/TCSII.2014.2368615
[^74]: R.Muller, S.Gambini, and J.M. Rabaey, ''A 0.013$ $mm$^2$, 5$ \mu$w,  dc-coupled neural signal acquisition ic with 0.5 v supply,'' IEEE  Journal of Solid-State Circuits, vol.47, no.1, pp. 232--243, Jan 2012.  [Online]: http://dx.doi.org/10.1109/JSSC.2011.2163552
[^75]: ''Rhd2164 digital electrophysiology interface chip - data sheet,'' Intan  Technologies, Los Angeles, California, December 2013. [Online]:  http://www.intantech.com/files/Intan_RHD2164_datasheet.pdf
[^76]: K.M. Al-Ashmouny, S.I. Chang, and E.Yoon, ''A 4 $\mu$w/ch analog front-end  module with moderate inversion and power-scalable sampling operation for 3-d  neural microsystems,'' IEEE Transactions on Biomedical Circuits and  Systems, vol.6, no.5, pp. 403--413, October 2012. [Online]:  http://dx.doi.org/10.1109/TBCAS.2012.2218105
[^77]: D.Han, Y.Zheng, R.Rajkumar, G.S. Dawe, and M.Je, ''A 0.45 v 100-channel  neural-recording ic with sub-$\mu$w/channel consumption in 0.18$\mu$m cmos,''  IEEE Transactions on Biomedical Circuits and Systems, vol.7, no.6,  pp. 735--746, December 2013. [Online]:  http://dx.doi.org/10.1109/TBCAS.2014.2298860
[^78]: S.B. Lee, H.M. Lee, M.Kiani, U.M. Jow, and M.Ghovanloo, ''An inductively  powered scalable 32-channel wireless neural recording system-on-a-chip for  neuroscience applications,'' IEEE Transactions on Biomedical Circuits  and Systems, vol.4, no.6, pp. 360--371, Dec 2010. [Online]:  http://dx.doi.org/10.1109/TBCAS.2010.2078814
[^79]: J.Yoo, L.Yan, D.El-Damak, M.A.B. Altaf, A.H. Shoeb, and A.P.  Chandrakasan, ''An 8-channel scalable eeg acquisition soc with  patient-specific seizure classification and recording processor,''  IEEE Journal of Solid-State Circuits, vol.48, no.1, pp. 214--228,  Jan 2013. [Online]: http://dx.doi.org/10.1109/JSSC.2012.2221220
[^80]: M.A.B. Altaf and J.Yoo, ''A 1.83$ \mu$j/classification, 8-channel,  patient-specific epileptic seizure classification soc using a non-linear  support vector machine,'' IEEE Transactions on Biomedical Circuits  and Systems, vol.10, no.1, pp. 49--60, Feb 2016. [Online]:  http://dx.doi.org/10.1109/TBCAS.2014.2386891
[^81]: K.Abdelhalim, H.M. Jafari, L.Kokarovtseva, J.L.P. Velazquez, and R.Genov,  ''64-channel uwb wireless neural vector analyzer soc with a closed-loop phase  synchrony-triggered neurostimulator,'' IEEE Journal of Solid-State  Circuits, vol.48, no.10, pp. 2494--2510, Oct 2013. [Online]:  http://dx.doi.org/10.1109/JSSC.2013.2272952
[^82]: A.Bagheri, S.R.I. Gabran, M.T. Salam, J.L.P. Velazquez, R.R. Mansour,  M.M.A. Salama, and R.Genov, ''Massively-parallel neuromonitoring and  neurostimulation rodent headset with nanotextured flexible microelectrodes,''  IEEE Transactions on Biomedical Circuits and Systems, vol.7, no.5,  pp. 601--609, Oct 2013. [Online]:  http://dx.doi.org/10.1109/TBCAS.2013.2281772
[^83]: H.G. Rhew, J.Jeong, J.A. Fredenburg, S.Dodani, P.G. Patil, and M.P.  Flynn, ''A fully self-contained logarithmic closed-loop deep brain  stimulation soc with wireless telemetry and wireless power management,''  IEEE Journal of Solid-State Circuits, vol.49, no.10, pp.  2213--2227, Oct 2014. [Online]:  http://dx.doi.org/10.1109/JSSC.2014.2346779
[^84]: W.Biederman, D.J. Yeager, N.Narevsky, J.Leverett, R.Neely, J.M. Carmena,  E.Alon, and J.M. Rabaey, ''A 4.78 mm 2 fully-integrated neuromodulation soc  combining 64 acquisition channels with digital compression and simultaneous  dual stimulation,'' IEEE Journal of Solid-State Circuits, vol.50,  no.4, pp. 1038--1047, April 2015. [Online]:  http://dx.doi.org/10.1109/JSSC.2014.2384736
[^85]: A.Mendez, A.Belghith, and M.Sawan, ''A dsp for sensing the bladder volume  through afferent neural pathways,'' IEEE Transactions on Biomedical  Circuits and Systems, vol.8, no.4, pp. 552--564, Aug 2014. [Online]:  http://dx.doi.org/10.1109/TBCAS.2013.2282087
[^86]: T.T. Liu and J.M. Rabaey, ''A 0.25 v 460 nw asynchronous neural signal  processor with inherent leakage suppression,'' IEEE Journal of  Solid-State Circuits, vol.48, no.4, pp. 897--906, April 2013. [Online]:  http://dx.doi.org/10.1109/JSSC.2013.2239096
[^87]: D.Han, Y.Zheng, R.Rajkumar, G.S. Dawe, and M.Je, ''A 0.45 v 100-channel  neural-recording ic with sub-$\mu$w/channel consumption in 0.18$ \mu$m  cmos,'' IEEE Transactions on Biomedical Circuits and Systems,  vol.7, no.6, pp. 735--746, Dec 2013. [Online]:  http://dx.doi.org/10.1109/TBCAS.2014.2298860
[^88]: R.Muller, H.P. Le, W.Li, P.Ledochowitsch, S.Gambini, T.Bjorninen,  A.Koralek, J.M. Carmena, M.M. Maharbiz, E.Alon, and J.M. Rabaey, ''A  minimally invasive 64-channel wireless $\mu$ecog implant,'' IEEE  Journal of Solid-State Circuits, vol.50, no.1, pp. 344--359, Jan 2015.  [Online]: http://dx.doi.org/10.1109/JSSC.2014.2364824
[^89]: B.Vigraham, J.Kuppambatti, and P.R. Kinget, ''Switched-mode operational  amplifiers and their application to continuous-time filters in nanoscale  cmos,'' IEEE Journal of Solid-State Circuits, vol.49, no.12, pp.  2758--2772, December 2014. [Online]:  http://dx.doi.org/10.1109/JSSC.2014.2354641
[^90]: V.Karkare, H.Chandrakumar, D.Rozgić, and D.Marković, ''Robust,  reconfigurable, and power-efficient biosignal recording systems,'' in  IEEE Proceedings of the Custom Integrated Circuits Conference, Sept  2014, pp. 1--8. [Online]: http://dx.doi.org/10.1109/CICC.2014.6946018
[^91]: L.B. Leene and T.G. Constandinou, ''A 0.45v continuous time-domain filter  using asynchronous oscillator structures,'' in IEEE Proceedings of  the International Conference on Electronics, Circuits and Systems, December  2016.
[^92]: R.Mohan, L.Yan, G.Gielen, C.V. Hoof, and R.F. Yazicioglu, ''0.35 v  time-domain-based instrumentation amplifier,'' Electronics Letters,  vol.50, no.21, pp. 1513--1514, October 2014. [Online]:  http://dx.doi.org/10.1049/el.2014.2471
[^93]: X.Zhang, Z.Zhang, Y.Li, C.Liu, Y.X. Guo, and Y.Lian, ''A 2.89$ \mu$w  dry-electrode enabled clockless wireless ecg soc for wearable applications,''  IEEE Journal of Solid-State Circuits, vol.51, no.10, pp.  2287--2298, Oct 2016. [Online]:  http://dx.doi.org/10.1109/JSSC.2016.2582863
[^94]: M.Elia, L.B. Leene, and T.G. Constandinou, ''Continuous-time micropower  interface for neural recording applications,'' in IEEE Proceedings of  the International Symposium on Circuits and Systems, May 2016, pp. 534--537.  [Online]: http://dx.doi.org/10.1109/ISCAS.2016.7527295
[^95]: N.Guo, Y.Huang, T.Mai, S.Patil, C.Cao, M.Seok, S.Sethumadhavan, and  Y.Tsividis, ''Energy-efficient hybrid analog/digital approximate computation  in continuous time,'' IEEE Journal of Solid-State Circuits, vol.51,  no.7, pp. 1514--1524, July 2016. [Online]:  http://dx.doi.org/10.1109/JSSC.2016.2543729
[^96]: B.Bozorgzadeh, D.R. Schuweiler, M.J. Bobak, P.A. Garris, and P.Mohseni,  ''Neurochemostat: A neural interface soc with integrated chemometrics for  closed-loop regulation of brain dopamine,'' IEEE Transactions on  Biomedical Circuits and Systems, vol.10, no.3, pp. 654--667, June 2016.  [Online]: http://dx.doi.org/10.1109/TBCAS.2015.2453791
[^97]: E.B. Myers and M.L. Roukes, ''Comparative advantages of mechanical  biosensors,'' Nature nanotechnology, vol.6, no.4, pp. 1748--3387,  April 2011. [Online]: http://dx.doi.org/10.1038/nnano.2011.44
[^98]: R.Machado, N.Soltani, S.Dufour, M.T. Salam, P.L. Carlen, R.Genov, and  M.Thompson, ''Biofouling-resistant impedimetric sensor for array  high-resolution extracellular potassium monitoring in the brain,''  Biosensors, vol.6, no.4, p.53, October 2016. [Online]:  http://dx.doi.org/10.3390/bios6040053
[^99]: J.Guo, W.Ng, J.Yuan, S.Li, and M.Chan, ''A 200-channel  area-power-efficient chemical and electrical dual-mode acquisition ic for the  study of neurodegenerative diseases,'' IEEE Transactions on  Biomedical Circuits and Systems, vol.10, no.3, pp. 567--578, June 2016.  [Online]: http://dx.doi.org/10.1109/TBCAS.2015.2468052
[^100]: D.A. Dombeck, A.N. Khabbaz, F.Collman, T.L. Adelman, and D.W. Tank,  ''Imaging large-scale neural activity with cellular resolution in awake,  mobile mice.'' Neuron, vol.56, no.1, pp. 43--57, October 2007.  [Online]: http://dx.doi.org/10.1016/j.neuron.2007.08.003
[^101]: T.York, S.B. Powell, S.Gao, L.Kahan, T.Charanya, D.Saha, N.W. Roberts,  T.W. Cronin, J.Marshall, S.Achilefu, S.P. Lake, B.Raman, and V.Gruev,  ''Bioinspired polarization imaging sensors: From circuits and optics to  signal processing algorithms and biomedical applications,'' Proceedings  of the IEEE, vol. 102, no.10, pp. 1450--1469, Oct 2014. [Online]:  http://dx.doi.org/10.1109/JPROC.2014.2342537
[^102]: K.Paralikar, P.Cong, O.Yizhar, L.E. Fenno, W.Santa, C.Nielsen,  D.Dinsmoor, B.Hocken, G.O. Munns, J.Giftakis, K.Deisseroth, and  T.Denison, ''An implantable optical stimulation delivery system for  actuating an excitable biosubstrate,'' IEEE Journal of Solid-State  Circuits, vol.46, no.1, pp. 321--332, Jan 2011. [Online]:  http://dx.doi.org/10.1109/JSSC.2010.2074110
[^103]: N.Ji and S.L. Smith, ''Technologies for imaging neural activity in large  volumes,'' Nature Neuroscience, vol.19, pp. 1154--1164, September  2016. [Online]: http://dx.doi.org/10.1038/nn.4358
[^104]: S.Song, K.D. Miller, and L.F. Abbott, ''Competitive hebbian learning through  spike-timing-dependent synaptic plasticity,'' Nature Neuroscience,  vol.3, pp. 919--926, September 2000. [Online]:  http://dx.doi.org/10.1038/78829
[^105]: T.Kurafuji, M.Haraguchi, M.Nakajima, T.Nishijima, T.Tanizaki, H.Yamasaki,  T.Sugimura, Y.Imai, M.Ishizaki, T.Kumaki, K.Murata, K.Yoshida,  E.Shimomura, H.Noda, Y.Okuno, S.Kamijo, T.Koide, H.J. Mattausch, and  K.Arimoto, ''A scalable massively parallel processor for real-time image  processing,'' IEEE Journal of Solid-State Circuits, vol.46, no.10,  pp. 2363--2373, October 2011. [Online]:  http://dx.doi.org/10.1109/JSSC.2011.2159528
[^106]: J.Y. Kim, M.Kim, S.Lee, J.Oh, K.Kim, and H.J. Yoo, ''A 201.4 gops 496 mw  real-time multi-object recognition processor with bio-inspired neural  perception engine,'' IEEE Journal of Solid-State Circuits, vol.45,  no.1, pp. 32--45, Jan 2010. [Online]:  http://dx.doi.org/10.1109/JSSC.2009.2031768
[^107]: C.C. Cheng, C.H. Lin, C.T. Li, and L.G. Chen, ''ivisual: An intelligent  visual sensor soc with 2790 fps cmos image sensor and 205 gops/w vision  processor,'' IEEE Journal of Solid-State Circuits, vol.44, no.1,  pp. 127--135, Jan 2009. [Online]:  http://dx.doi.org/10.1109/JSSC.2008.2007158
[^108]: H.Noda, M.Nakajima, K.Dosaka, K.Nakata, M.Higashida, O.Yamamoto,  K.Mizumoto, T.Tanizaki, T.Gyohten, Y.Okuno, H.Kondo, Y.Shimazu,  K.Arimoto, K.Saito, and T.Shimizu, ''The design and implementation of the  massively parallel processor based on the matrix architecture,'' IEEE  Journal of Solid-State Circuits, vol.42, no.1, pp. 183--192, Jan 2007.  [Online]: http://dx.doi.org/10.1109/JSSC.2006.886545
[^109]: M.S. Chae, W.Liu, and M.Sivaprakasam, ''Design optimization for integrated  neural recording systems,'' IEEE Journal of Solid-State Circuits,  vol.43, no.9, pp. 1931--1939, September 2008. [Online]:  http://dx.doi.org/10.1109/JSSC.2008.2001877
[^110]: K.J. Miller, L.B. Sorensen, J.G. Ojemann, and M.den Nijs, ''Power-law  scaling in the brain surface electric potential,'' PLoS Comput Biol,  vol.5, no.12, pp. 1--10, 12 2009. [Online]:  http://dx.doi.org/10.1371%2Fjournal.pcbi.1000609
[^111]: R.Harrison and C.Charles, ''A low-power low-noise cmos amplifier for neural  recording applications,'' IEEE Journal of Solid-State Circuits,  vol.38, no.6, pp. 958--965, June 2003. [Online]:  http://dx.doi.org/10.1109/JSSC.2003.811979
[^112]: W.Sansen, ''1.3 analog cmos from 5 micrometer to 5 nanometer,'' in  IEEE Proceedings of the International Solid-State Circuits  Conference.\hskip 1em plus 0.5em minus 0.4em
elax IEEE, February 2015, pp.  1--6. [Online]: http://dx.doi.org/10.1109/ISSCC.2015.7062848
[^113]: M.S.J. Steyaert and W.M.C. Sansen, ''A micropower low-noise monolithic  instrumentation amplifier for medical purposes,'' IEEE Journal of  Solid-State Circuits, vol.22, no.6, pp. 1163--1168, December 1987.  [Online]: http://dx.doi.org/10.1109/JSSC.1987.1052869
[^114]: W.Wattanapanitch, M.Fee, and R.Sarpeshkar, ''An energy-efficient micropower  neural recording amplifier,'' IEEE Transactions on Biomedical  Circuits and Systems, vol.1, no.2, pp. 136--147, June 2007. [Online]:  http://dx.doi.org/10.1109/TBCAS.2007.907868
[^115]: B.Johnson and A.Molnar, ''An orthogonal current-reuse amplifier for  multi-channel sensing,'' IEEE Journal of Solid-State Circuits,  vol.48, no.6, pp. 1487--1496, June 2013. [Online]:  http://dx.doi.org/10.1109/JSSC.2013.2257478
[^116]: C.Qian, J.Parramon, and E.Sanchez-Sinencio, ''A micropower low-noise neural  recording front-end circuit for epileptic seizure detection,'' IEEE  Journal of Solid-State Circuits, vol.46, no.6, pp. 1392--1405, June 2011.  [Online]: http://dx.doi.org/10.1109/JSSC.2011.2126370
[^117]: X.Zou, L.Liu, J.H. Cheong, L.Yao, P.Li, M.-Y. Cheng, W.L. Goh,  R.Rajkumar, G.Dawe, K.-W. Cheng, and M.Je, ''A 100-channel 1-mw  implantable neural recording ic,'' IEEE Transactions on Circuits and  Systems---Part I: Regular Papers, vol.60, no.10, pp. 2584--2596, October  2013. [Online]: http://dx.doi.org/10.1109/TCSI.2013.2249175
[^118]: V.Majidzadeh, A.Schmid, and Y.Leblebici, ''Energy efficient low-noise neural  recording amplifier with enhanced noise efficiency factor,'' IEEE  Transactions on Biomedical Circuits and Systems, vol.5, no.3, pp.  262--271, June 2011. [Online]:  http://dx.doi.org/10.1109/TBCAS.2010.2078815
[^119]: C.C. Enz and E.A. Vittoz, Charge-based MOS transistor modeling: the EKV  model for low-power AND RF IC design.\hskip 1em plus 0.5em minus 0.4em
elax  John Wiley & Sons, August 2006. [Online]:  http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0470855452.html
[^120]: Y.Yasuda, T.-J.K. Liu, and C.Hu, ''Flicker-noise impact on scaling of  mixed-signal cmos with hfsion,'' IEEE Transactions on Electron  Devices, vol.55, no.1, pp. 417--422, January 2008. [Online]:  http://dx.doi.org/10.1109/TED.2007.910759
[^121]: S.-Y. Wu, C.Lin, M.Chiang, J.Liaw, J.Cheng, S.Yang, M.Liang,  T.Miyashita, C.Tsai, B.Hsu, H.Chen, T.Yamamoto, S.Chang, V.Chang,  C.Chang, J.Chen, H.Chen, K.Ting, Y.Wu, K.Pan, R.Tsui, C.Yao,  P.Chang, H.Lien, T.Lee, H.Lee, W.Chang, T.Chang, R.Chen, M.Yeh,  C.Chen, Y.Chiu, Y.Chen, H.Huang, Y.Lu, C.Chang, M.Tsai, C.Liu,  K.Chen, C.Kuo, H.Lin, S.Jang, and Y.Ku, ''A 16nm finfet cmos technology  for mobile soc and computing applications,'' in IEEE Proceedings of  the International Electron Devices Meeting, December 2013, pp. 9.1.1--9.1.4.  [Online]: http://dx.doi.org/10.1109/IEDM.2013.6724591
[^122]: L.B. Leene, Y.Liu, and T.G. Constandinou, ''A compact recording array for  neural interfaces,'' in IEEE Proceedings of the Biomedical Circuits  and Systems Conference, October 2013, pp. 97--100. [Online]:  http://dx.doi.org/10.1109/BioCAS.2013.6679648
[^123]: Q.Fan, F.Sebastiano, J.Huijsing, and K.Makinwa, ''A $1.8 \mu  w\:60 nv/√Hz$ capacitively-coupled chopper instrumentation amplifier  in 65 nm cmos for wireless sensor nodes,'' IEEE Journal of  Solid-State Circuits, vol.46, no.7, pp. 1534--1543, July 2011. [Online]:  http://dx.doi.org/10.1109/JSSC.2011.2143610
[^124]: H.Chandrakumar and D.Markovic, ''A simple area-efficient ripple-rejection  technique for chopped biosignal amplifiers,'' IEEE Transactions on  Circuits and Systems---Part II: Express Briefs, vol.62, no.2, pp.  189--193, February 2015. [Online]:  http://dx.doi.org/10.1109/TCSII.2014.2387686
[^125]: H.Chandrakumar and D.Markovic, ''A 2$\mu$w 40mvpp linear-input-range  chopper-stabilized bio-signal amplifier with boosted input impedance of  300mohm and electrode-offset filtering,'' in IEEE Proceedings of the  International Solid-State Circuits Conference.\hskip 1em plus 0.5em minus  0.4em
elax IEEE, January 2016, pp. 96--97. [Online]:  http://dx.doi.org/10.1109/ISSCC.2016.7417924
[^126]: H.Rezaee-Dehsorkh, N.Ravanshad, R.Lotfi, K.Mafinezhad, and A.M. Sodagar,  ''Analysis and design of tunable amplifiers for implantable neural recording  applications,'' IEEE Transactions on Emerging and Selected Topics in  Circuits and Systems, vol.1, no.4, pp. 546--556, December 2011. [Online]:  http://dx.doi.org/10.1109/JETCAS.2011.2174492
[^127]: X.Zou, X.Xu, L.Yao, and Y.Lian, ''A 1-v 450-nw fully integrated  programmable biomedical sensor interface chip,'' IEEE Journal of  Solid-State Circuits, vol.44, no.4, pp. 1067--1077, April 2009. [Online]:  http://dx.doi.org/10.1109/JSSC.2009.2014707
[^128]: L.Leene and T.Constandinou, ''Ultra-low power design strategy for two-stage  amplifier topologies,'' Electronics Letters, vol.50, no.8, pp.  583--585, April 2014. [Online]: http://dx.doi.org/10.1049/el.2013.4196
[^129]: H.G. Rey, C.Pedreira, and R.Q. Quiroga, ''Past, present and future of spike  sorting techniques,'' Brain Research Bulletin, vol. 119, Part B, pp.  106--117, October 2015, advances in electrophysiological data analysis.  [Online]:  http://www.sciencedirect.com/science/article/pii/S0361923015000684
[^130]: Y.Chen, A.Basu, L.Liu, X.Zou, R.Rajkumar, G.S. Dawe, and M.Je, ''A  digitally assisted, signal folding neural recording amplifier,'' IEEE  Transactions on Biomedical Circuits and Systems, vol.8, no.4, pp.  528--542, August 2014. [Online]:  http://dx.doi.org/10.1109/TBCAS.2013.2288680
[^131]: X.Yue, ''Determining the reliable minimum unit capacitance for the dac  capacitor array of sar adcs,'' Microelectronics Journal, vol.44,  no.6, pp. 473 -- 478, 2013. [Online]:  http://www.sciencedirect.com/science/article/pii/S0026269213000815
[^132]: Y.Zhu, C.-H. Chan, U.-F. Chio, S.-W. Sin, S.-P. U, R.Martins, and  F.Maloberti, ''Split-sar adcs: Improved linearity with power and speed  optimization,'' IEEE Transactions on Very Large Scale Integration  (VLSI) Systems, vol.22, no.2, pp. 372--383, February 2014. [Online]:  http://dx.doi.org/10.1109/TVLSI.2013.2242501
[^133]: L.Xie, G.Wen, J.Liu, and Y.Wang, ''Energy-efficient hybrid capacitor  switching scheme for sar adc,'' Electronics Letters, vol.50, no.1,  pp. 22--23, January 2014. [Online]:  http://dx.doi.org/10.1049/el.2013.2794
[^134]: P.Nuzzo, F.DeBernardinis, P.Terreni, and G.Vander Plas, ''Noise analysis  of regenerative comparators for reconfigurable adc architectures,''  IEEE Transactions on Circuits and Systems---Part I: Regular  Papers, vol.55, no.6, pp. 1441--1454, July 2008. [Online]:  http://dx.doi.org/10.1109/TCSI.2008.917991
[^135]: G.Heinzel, A.R\"udiger, and R.Schilling, ''Spectrum and spectral density  estimation by the discrete fourier transform (dft), including a comprehensive  list of window functions and some new at-top windows,'' pp. 25--27, February  2002. [Online]: http://hdl.handle.net/11858/00-001M-0000-0013-557A-5
[^136]: F.Gerfers, M.Ortmanns, and Y.Manoli, ''A 1.5-v 12-bit power-efficient  continuous-time third-order sigma; delta; modulator,'' IEEE Journal  of Solid-State Circuits, vol.38, no.8, pp. 1343--1352, Aug 2003. [Online]:  http://dx.doi.org/10.1109/JSSC.2003.814432
[^137]: Y.Chae, K.Souri, and K.A.A. Makinwa, ''A 6.3$ \mu$w 20$ $bit incremental  zoom-adc with 6 ppm inl and 1 $\mu$v offset,'' IEEE Journal of  Solid-State Circuits, vol.48, no.12, pp. 3019--3027, Dec 2013. [Online]:  http://dx.doi.org/10.1109/JSSC.2013.2278737
[^138]: Y.S. Shu, L.T. Kuo, and T.Y. Lo, ''An oversampling sar adc with dac mismatch  error shaping achieving 105db sfdr and 101db sndr over 1khz bw in 55nm  cmos,'' in IEEE Proceedings of the International Solid-State Circuits  Conference, January 2016, pp. 458--459. [Online]:  http://dx.doi.org/10.1109/ISSCC.2016.7418105
[^139]: P.Harpe, E.Cantatore, and A.van Roermund, ''An oversampled 12/14b sar adc  with noise reduction and linearity enhancements achieving up to 79.1db  sndr,'' in IEEE Proceedings of the International Solid-State Circuits  Conference, February 2014, pp. 194--195. [Online]:  http://dx.doi.org/10.1109/ISSCC.2014.6757396
[^140]: M.Braverman, J.Schneider, and C.Rojas, ''Space-bounded church-turing thesis  and computational tractability of closed systems,'' Physical Review  Letters, vol. 115, August 2015. [Online]:  http://link.aps.org/doi/10.1103/PhysRevLett.115.098701
[^141]: M.Verhelst and A.Bahai, ''Where analog meets digital: Analog-to-information  conversion and beyond,'' IEEE Solid-State Circuits Magazine, vol.7,  no.3, pp. 67--80, September 2015. [Online]:  http://dx.doi.org/10.1109/MSSC.2015.2442394
[^142]: H.A. Marblestone, M.B. Zamft, G.Y. Maguire, G.M. Shapiro, R.T. Cybulski,  I.J. Glaser, D.Amodei, P.B. Stranges, R.Kalhor, A.D. Dalrymple, D.Seo,  E.Alon, M.M. Maharbiz, M.J. Carmena, M.J. Rabaey, S.E. Boyden, M.G.  Church, and P.K. Kording, ''Physical principles for scalable neural  recording,'' Frontiers in Computational Neuroscience, vol.7, no. 137,  2013. [Online]:  http://www.frontiersin.org/computational_neuroscience/10.3389/fncom.2013.00137
[^143]: L.Traver, C.Tarin, P.Marti, and N.Cardona, ''Adaptive-threshold neural  spike by noise-envelope tracking,'' Electronics Letters, vol.43,  no.24, pp. 1333--1335, November 2007. [Online]:  http://dx.doi.org/10.1049/el:20071631
[^144]: I.Obeid and P.Wolf, ''Evaluation of spike-detection algorithms fora  brain-machine interface application,'' IEEE Transactions on  Biomedical Engineering, vol.51, no.6, pp. 905--911, June 2004. [Online]:  http://dx.doi.org/10.1109/TBME.2004.826683
[^145]: P.Watkins, G.Santhanam, K.Shenoy, and R.Harrison, ''Validation of adaptive  threshold spike detector for neural recording,'' in IEEE Proceedings  of the International Conference on Engineering in Medicine and Biology  Society, vol.2, September 2004, pp. 4079--4082. [Online]:  http://dx.doi.org/10.1109/IEMBS.2004.1404138
[^146]: T.Takekawa, Y.Isomura, and T.Fukai, ''Accurate spike sorting for multi-unit  recordings,'' European Journal of Neuroscience, vol.31, no.2, pp.  263--272, 2010. [Online]:  http://dx.doi.org/10.1111/j.1460-9568.2009.07068.x
[^147]: A.Zviagintsev, Y.Perelman, and R.Ginosar, ''Low-power architectures for  spike sorting,'' in IEEE Proceedings of the International Conference  on Neural Engineering, March 2005, pp. 162--165. [Online]:  http://dx.doi.org/10.1109/CNE.2005.1419579
[^148]: A.Rodriguez-Perez, J.Ruiz-Amaya, M.Delgado-Restituto, and  A.Rodriguez-Vazquez, ''A low-power programmable neural spike detection  channel with embedded calibration and data compression,'' IEEE  Transactions on Biomedical Circuits and Systems, vol.6, no.2, pp. 87--100,  April 2012. [Online]: http://dx.doi.org/10.1109/TBCAS.2012.2187352
[^149]: U.Rutishauser, E.M. Schuman, and A.N. Mamelak, ''Online detection and  sorting of extracellularly recorded action potentials in human medial  temporal lobe recordings, in vivo,'' Journal of Neuroscience Methods,  vol. 154, no. 1–2, pp. 204 -- 224, 2006. [Online]:  http://www.sciencedirect.com/science/article/pii/S0165027006000033
[^150]: F.Franke, M.Natora, C.Boucsein, M.Munk, and K.Obermayer,  ''\BIBforeignlanguageEnglishAn online spike detection and spike  classification algorithm capable of instantaneous resolution of overlapping  spikes,'' \BIBforeignlanguageEnglishJournal of Computational  Neuroscience, vol.29, no. 1-2, pp. 127--148, 2010. [Online]:  http://dx.doi.org/10.1007/s10827-009-0163-5
[^151]: M.S. Chae, Z.Yang, M.Yuce, L.Hoang, and W.Liu, ''A 128-channel 6 mw  wireless neural recording ic with spike feature extraction and uwb  transmitter,'' IEEE Transactions on Neural Systems and Rehabilitation  Engineering, vol.17, no.4, pp. 312--321, August 2009. [Online]:  http://dx.doi.org/10.1109/TNSRE.2009.2021607
[^152]: P.H. Thakur, H.Lu, S.S. Hsiao, and K.O. Johnson, ''Automated optimal  detection and classification of neural action potentials in extra-cellular  recordings,'' Journal of Neuroscience Methods, vol. 162, no. 1–2,  pp. 364 -- 376, 2007. [Online]:  ttp://www.sciencedirect.com/science/article/pii/S0165027007000477
[^153]: J.Zhang, Y.Suo, S.Mitra, S.Chin, S.Hsiao, R.Yazicioglu, T.Tran, and  R.Etienne-Cummings, ''An efficient and compact compressed sensing  microsystem for implantable neural recordings,'' IEEE Transactions on  Biomedical Circuits and Systems, vol.8, no.4, pp. 485--496, August 2014.  [Online]: http://dx.doi.org/10.1109/TBCAS.2013.2284254
[^154]: Y.Suo, J.Zhang, T.Xiong, P.S. Chin, R.Etienne-Cummings, and T.D. Tran,  ''Energy-efficient multi-mode compressed sensing system for implantable  neural recordings,'' IEEE Transactions on Biomedical Circuits and  Systems, vol.8, no.5, pp. 648--659, October 2014. [Online]:  http://dx.doi.org/10.1109/TBCAS.2014.2359180
[^155]: B.Yu, T.Mak, X.Li, F.Xia, A.Yakovlev, Y.Sun, and C.S. Poon, ''Real-time  fpga-based multichannel spike sorting using hebbian eigenfilters,''  IEEE Transactions on Emerging and Selected Topics in Circuits and  Systems, vol.1, no.4, pp. 502--515, December 2011. [Online]:  http://dx.doi.org/10.1109/JETCAS.2012.2183430
[^156]: V.Ventura, ''Automatic spike sorting using tuning information,'' Neural  computation, vol.21, no.9, pp. 2466--2501, September 2009. [Online]:  http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4167425/
[^157]: D.Y. Barsakcioglu, A.Eftekhar, and T.G. Constandinou, ''Design optimisation  of front-end neural interfaces for spike sorting systems,'' in IEEE  Proceedings of the International Symposium on Circuits and Systems, May  2013, pp. 2501--2504. [Online]:  http://dx.doi.org/10.1109/ISCAS.2013.6572387
[^158]: A.M. Sodagar, K.D. Wise, and K.Najafi, ''A fully integrated mixed-signal  neural processor for implantable multichannel cortical recording,''  IEEE Transactions on Biomedical Engineering, vol.54, no.6, pp.  1075--1088, June 2007. [Online]:  http://dx.doi.org/10.1109/TBME.2007.894986
[^159]: Y.Xin, W.X. Li, R.C. Cheung, R.H. Chan, H.Yan, D.Song, and T.W. Berger,  ''An fpga based scalable architecture of a stochastic state point process  filter (ssppf) to track the nonlinear dynamics underlying neural spiking,''  Microelectronics Journal, vol.45, no.6, pp. 690 -- 701, June 2014.  [Online]:  http://www.sciencedirect.com/science/article/pii/S0026269214000913
[^160]: C.Qian, J.Shi, J.Parramon, and E.Sánchez-Sinencio, ''A low-power  configurable neural recording system for epileptic seizure detection,''  IEEE Transactions on Biomedical Circuits and Systems, vol.7, no.4,  pp. 499--512, August 2013. [Online]:  http://dx.doi.org/10.1109/TBCAS.2012.2228857
[^161]: K.C. Chun, P.Jain, J.H. Lee, and C.H. Kim, ''A 3t gain cell embedded dram  utilizing preferential boosting for high density and low power on-die  caches,'' IEEE Journal of Solid-State Circuits, vol.46, no.6, pp.  1495--1505, June 2011. [Online]:  http://dx.doi.org/10.1109/JSSC.2011.2128150
[^162]: R.E. Matick and S.E. Schuster, ''Logic-based edram: Origins and rationale for  use,'' IBM Journal of Research AND Development, vol.49, no.1, pp.  145--165, January 2005. [Online]: http://dx.doi.org/10.1147/rd.491.0145
[^163]: R.Nair, ''Evolution of memory architecture,'' Proceedings of the  IEEE, vol. 103, no.8, pp. 1331--1345, August 2015. [Online]:  http://dx.doi.org/10.1109/JPROC.2015.2435018
[^164]: C.E. Molnar and I.W. Jones, ''Simple circuits that work for complicated  reasons,'' in IEEE Proceedings of the International Symposium on  Advanced Research in Asynchronous Circuits and Systems, 2000, pp. 138--149.  [Online]: http://dx.doi.org/10.1109/ASYNC.2000.836995
[^165]: H.Schorr, ''Computer-aided digital system design and analysis using a register  transfer language,'' IEEE Transactions on Electronic Computers, vol.  EC-13, no.6, pp. 730--737, December 1964. [Online]:  http://dx.doi.org/10.1109/PGEC.1964.263907
[^166]: D.Wang, A.Rajendiran, S.Ananthanarayanan, H.Patel, M.Tripunitara, and  S.Garg, ''Reliable computing with ultra-reduced instruction set  coprocessors,'' IEEE Micro, vol.34, no.6, pp. 86--94, November  2014. [Online]: http://dx.doi.org/10.1109/MM.2013.130
[^167]: ''Msp430g2x53 mixed signal microcontroller - data sheet,'' Texas Instruments  Incorporated, Dallas, Texas, pp. 403--413, May 2013. [Online]:  http://www.ti.com/lit/ds/symlink/msp430g2553.pdf
[^168]: F.L. Yuan, C.C. Wang, T.H. Yu, and D.Marković, ''A multi-granularity fpga  with hierarchical interconnects for efficient and flexible mobile  computing,'' IEEE Journal of Solid-State Circuits, vol.50, no.1,  pp. 137--149, January 2015. [Online]:  http://dx.doi.org/10.1109/JSSC.2014.2372034
[^169]: B.Vigraham, J.Kuppambatti, and P.R. Kinget, ''Switched-mode operational  amplifiers and their application to continuous-time filters in nanoscale  cmos,'' IEEE Journal of Solid-State Circuits, vol.49, no.12, pp.  2758--2772, December 2014. [Online]:  http://dx.doi.org/10.1109/JSSC.2014.2354641
[^170]: Y.Tsividis, ''Event-driven data acquisition and continuous-time digital signal  processing,'' in IEEE Proceedings of the Custom Integrated Circuits  Conference, September 2010, pp. 1--8. [Online]:  http://dx.doi.org/10.1109/CICC.2010.5617618
[^171]: I.Lee, D.Sylvester, and D.Blaauw, ''A constant energy-per-cycle ring  oscillator over a wide frequency range for wireless sensor nodes,''  IEEE Journal of Solid-State Circuits, vol.51, no.3, pp. 697--711,  March 2016. [Online]: http://dx.doi.org/10.1109/JSSC.2016.2517133
[^172]: B.Drost, M.Talegaonkar, and P.K. Hanumolu, ''Analog filter design using ring  oscillator integrators,'' IEEE Journal of Solid-State Circuits,  vol.47, no.12, pp. 3120--3129, December 2012. [Online]:  http://dx.doi.org/10.1109/JSSC.2012.2225738
[^173]: V.Unnikrishnan and M.Vesterbacka, ''Time-mode analog-to-digital conversion  using standard cells,'' IEEE Transactions on Circuits and  Systems---Part I: Fundamental Theory and Applications, vol.61, no.12,  pp. 3348--3357, December 2014. [Online]:  http://dx.doi.org/10.1109/TCSI.2014.2340551
[^174]: K.Yang, D.Blaauw, and D.Sylvester, ''An all-digital edge racing true random  number generator robust against pvt variations,'' IEEE Journal of  Solid-State Circuits, vol.51, no.4, pp. 1022--1031, April 2016. [Online]:  http://dx.doi.org/10.1109/JSSC.2016.2519383
[^175]: S.Chatterjee, Y.Tsividis, and P.Kinget, ''0.5-v analog circuit techniques  and their application in ota and filter design,'' IEEE Journal of  Solid-State Circuits, vol.40, no.12, pp. 2373--2387, December 2005.  [Online]: http://dx.doi.org/10.1109/JSSC.2005.856280
[^176]: M.Alioto, ''Understanding dc behavior of subthreshold cmos logic through  closed-form analysis,'' IEEE Transactions on Circuits and  Systems---Part I: Fundamental Theory and Applications, vol.57, no.7, pp.  1597--1607, July 2010. [Online]:  http://dx.doi.org/10.1109/TCSI.2009.2034233
[^177]: A.Hajimiri and T.Lee, ''A general theory of phase noise in electrical  oscillators,'' IEEE Journal of Solid-State Circuits, vol.33, no.2,  pp. 179--194, February 1998. [Online]:  http://dx.doi.org/10.1109/4.658619
[^178]: A.Demir, A.Mehrotra, and J.Roychowdhury, ''Phase noise in oscillators: a  unifying theory and numerical methods for characterization,'' IEEE  Transactions on Circuits and Systems---Part I: Fundamental Theory and  Applications, vol.47, no.5, pp. 655--674, May 2000. [Online]:  http://dx.doi.org/10.1109/81.847872
[^179]: A.Hajimiri, S.Limotyrakis, and T.Lee, ''Phase noise in multi-gigahertz cmos  ring oscillators,'' in IEEE Proceedings of the Custom Integrated  Circuits Conference, May 1998, pp. 49--52. [Online]:  http://dx.doi.org/10.1109/CICC.1998.694905
[^180]: W.Jiang, V.Hokhikyan, H.Chandrakumar, V.Karkare, and D.Markovic, ''A  ±50mv linear-input-range vco-based neural-recording front-end with  digital nonlinearity correction,'' in IEEE Proceedings of the  International Solid-State Circuits Conference, January 2016, pp. 484--485.  [Online]: http://dx.doi.org/10.1109/ISSCC.2016.7418118
[^181]: C.Weltin-Wu and Y.Tsividis, ''An event-driven clockless level-crossing adc  with signal-dependent adaptive resolution,'' IEEE Journal of  Solid-State Circuits, vol.48, no.9, pp. 2180--2190, September 2013.  [Online]: http://dx.doi.org/10.1109/JSSC.2013.2262738
[^182]: H.Y. Yang and R.Sarpeshkar, ''A bio-inspired ultra-energy-efficient  analog-to-digital converter for biomedical applications,'' IEEE  Transactions on Circuits and Systems---Part I: Fundamental Theory and  Applications, vol.53, no.11, pp. 2349--2356, November 2006. [Online]:  http://dx.doi.org/10.1109/TCSI.2006.884463
[^183]: F.Corradi and G.Indiveri, ''A neuromorphic event-based neural recording  system for smart brain-machine-interfaces,'' IEEE Transactions on  Biomedical Circuits and Systems, vol.9, no.5, pp. 699--709, October 2015.  [Online]: http://dx.doi.org/10.1109/TBCAS.2015.2479256
[^184]: K.A. Ng and Y.P. Xu, ''A compact, low input capacitance neural recording  amplifier,'' IEEE Transactions on Biomedical Circuits and Systems,  vol.7, no.5, pp. 610--620, October 2013. [Online]:  http://dx.doi.org/10.1109/TBCAS.2013.2280066
[^185]: J.Agustin and M.Lopez-Vallejo, ''An in-depth analysis of ring oscillators:  Exploiting their configurable duty-cycle,'' IEEE Transactions on  Circuits and Systems---Part I: Fundamental Theory and Applications,  vol.62, no.10, pp. 2485--2494, October 2015. [Online]:  http://dx.doi.org/10.1109/TCSI.2015.2476300
[^186]: K.Ng and Y.P. Xu, ''A compact, low input capacitance neural recording  amplifier,'' IEEE Transactions on Biomedical Circuits and Systems,  vol.7, no.5, pp. 610--620, October 2013. [Online]:  http://dx.doi.org/10.1109/TBCAS.2013.2280066
[^187]: M.Elia, L.B. Leene, and T.G. Constandinou, ''Continuous-time micropower  interface for neural recording applications,'' in IEEE Proceedings of  the International Symposium on Circuits and Systems, May 2016.
[^188]: Y.W. Li, K.L. Shepard, and Y.P. Tsividis, ''A continuous-time programmable  digital fir filter,'' IEEE Journal of Solid-State Circuits, vol.41,  no.11, pp. 2512--2520, November 2006. [Online]:  http://dx.doi.org/10.1109/JSSC.2006.883314
[^189]: B.Schell and Y.Tsividis, ''A continuous-time adc/dsp/dac system with no clock  and with activity-dependent power dissipation,'' IEEE Journal of  Solid-State Circuits, vol.43, no.11, pp. 2472--2481, November 2008.  [Online]: http://dx.doi.org/10.1109/JSSC.2008.2005456
[^190]: S.Aouini, K.Chuai, and G.W. Roberts, ''Anti-imaging time-mode filter design  using a pll structure with transfer function dft,'' IEEE Transactions  on Circuits and Systems---Part I: Fundamental Theory and Applications,  vol.59, no.1, pp. 66--79, January 2012. [Online]:  http://dx.doi.org/10.1109/TCSI.2011.2161411
[^191]: X.Xing and G.G.E. Gielen, ''A 42 fj/step-fom two-step vco-based delta-sigma  adc in 40 nm cmos,'' IEEE Journal of Solid-State Circuits, vol.50,  no.3, pp. 714--723, March 2015. [Online]:  http://dx.doi.org/10.1109/JSSC.2015.2393814
[^192]: K.Reddy, S.Rao, R.Inti, B.Young, A.Elshazly, M.Talegaonkar, and P.K.  Hanumolu, ''A 16-mw 78-db sndr 10-mhz bw ct $\delta \sigma$ adc using  residue-cancelling vco-based quantizer,'' IEEE Journal of Solid-State  Circuits, vol.47, no.12, pp. 2916--2927, December 2012. [Online]:  http://dx.doi.org/10.1109/JSSC.2012.2218062
[^193]: J.Daniels, W.Dehaene, M.S.J. Steyaert, and A.Wiesbauer, ''A/d conversion  using asynchronous delta-sigma modulation and time-to-digital conversion,''  IEEE Transactions on Circuits and Systems---Part I: Fundamental  Theory and Applications, vol.57, no.9, pp. 2404--2412, September 2010.  [Online]: http://dx.doi.org/10.1109/TCSI.2010.2043169
[^194]: F.M. Yaul and A.P. Chandrakasan, ''A sub-$\mu$w 36nv/$√Hz$ chopper  amplifier for sensors using a noise-efficient inverter-based 0.2v-supply  input stage,'' in IEEE Proceedings of the International Solid-State  Circuits Conference, January 2016, pp. 94--95. [Online]:  http://dx.doi.org/10.1109/ISSCC.2016.7417923
[^195]: S.Patil, A.Ratiu, D.Morche, and Y.Tsividis, ''A 3-10 fj/conv-step  error-shaping alias-free continuous-time adc,'' IEEE Journal of  Solid-State Circuits, vol.51, no.4, pp. 908--918, April 2016. [Online]:  http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7433385&isnumber=7446371
[^196]: J.M. Duarte-Carvajalino and G.Sapiro, ''Learning to sense sparse signals:  Simultaneous sensing matrix and sparsifying dictionary optimization,''  IEEE Transactions on Image Processing, vol.18, no.7, pp.  1395--1408, July 2009. [Online]:  http://dx.doi.org/10.1109/TIP.2009.2022459
[^197]: R.S. Schneider and H.C. Card, ''Analog hardware implementation issues in  deterministic boltzmann machines,'' IEEE Transactions on Circuits and  Systems---Part II: Analog and Digital Signal Processing, vol.45, no.3,  pp. 352--360, Mar 1998. [Online]: http://dx.doi.org/10.1109/82.664241
[^198]: J.Lu, S.Young, I.Arel, and J.Holleman, ''A 1 tops/w analog deep  machine-learning engine with floating-gate storage in 0.13$\mu$m cmos,''  IEEE Journal of Solid-State Circuits, vol.50, no.1, pp. 270--281,  January 2015. [Online]: http://dx.doi.org/10.1109/JSSC.2014.2356197
[^199]: M.T. Wolf and J.W. Burdick, ''A bayesian clustering method for tracking  neural signals over successive intervals,'' IEEE Transactions on  Biomedical Engineering, vol.56, no.11, pp. 2649--2659, November 2009.  [Online]: http://dx.doi.org/10.1109/TBME.2009.2027604
[^200]: D.Y. Barsakcioglu and T.G. Constandinou, ''A 32-channel mcu-based feature  extraction and classification for scalable on-node spike sorting,'' in  IEEE Proceedings of the International Symposium on Circuits and  Systems, May 2016.
[^201]: R.P. Feynman, ''There's plenty of room at the bottom,'' American  Physical Society, vol.23, no.5, pp. 22--36, February 1960. [Online]:  http://www.zyvex.com/nanotech/feynman.html
[^202]: G.Leuba and L.J. Garey, ''Comparison of neuronal and glial numerical density  in primary and secondary visual cortex of man,'' Experimental Brain  Research, vol.77, no.1, pp. 31--38, 1989. [Online]:  http://dx.doi.org/10.1007/BF00250564
[^203]: I.Guideline, ''Guidelines for limiting exposure to time-varying electric,  magnetic, and electromagnetic fields (up to 300 ghz),'' Health  Physics, vol.74, no.4, pp. 494--522, October 1998. [Online]:  http://www.icnirp.org/cms/upload/publications/ICNIRPemfgdl.pdf
[^204]: L.B. Leene, S.Luan, and T.G. Constandinou, ''A 890fj/bit uwb transmitter for  soc integration in high bit-rate transcutaneous bio-implants,'' in  IEEE Proceedings of the International Symposium on Circuits and  Systems, May 2013, pp. 2271--2274. [Online]:  http://dx.doi.org/10.1109/ISCAS.2013.6572330
[^205]: ''Unconventional processing of signals for intelligent data exploitation  (upside),'' Defense Advanced Research Projects Agency, Arlington, Texas,  January 2016. [Online]:  http://www.darpa.mil/program/unconventional-processing-of-signals-for-intelligent-data-exploitation