237 lines
		
	
	
		
			9.6 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
		
		
			
		
	
	
			237 lines
		
	
	
		
			9.6 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
| 
								 | 
							
								==================================
							 | 
						||
| 
								 | 
							
								A guide to masked arrays in NumPy
							 | 
						||
| 
								 | 
							
								==================================
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								.. Contents::
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								See http://www.scipy.org/scipy/numpy/wiki/MaskedArray (dead link)
							 | 
						||
| 
								 | 
							
								for updates of this document.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								History
							 | 
						||
| 
								 | 
							
								-------
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								As a regular user of MaskedArray, I (Pierre G.F. Gerard-Marchant) became
							 | 
						||
| 
								 | 
							
								increasingly frustrated with the subclassing of masked arrays (even if
							 | 
						||
| 
								 | 
							
								I can only blame my inexperience). I needed to develop a class of arrays
							 | 
						||
| 
								 | 
							
								that could store some additional information along with numerical values,
							 | 
						||
| 
								 | 
							
								while keeping the possibility for missing data (picture storing a series
							 | 
						||
| 
								 | 
							
								of dates along with measurements, what would later become the `TimeSeries
							 | 
						||
| 
								 | 
							
								Scikit <http://projects.scipy.org/scipy/scikits/wiki/TimeSeries>`__
							 | 
						||
| 
								 | 
							
								(dead link).
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								I started to implement such a class, but then quickly realized that
							 | 
						||
| 
								 | 
							
								any additional information disappeared when processing these subarrays
							 | 
						||
| 
								 | 
							
								(for example, adding a constant value to a subarray would erase its
							 | 
						||
| 
								 | 
							
								dates). I ended up writing the equivalent of *numpy.core.ma* for my
							 | 
						||
| 
								 | 
							
								particular class, ufuncs included. Everything went fine until I needed to
							 | 
						||
| 
								 | 
							
								subclass my new class, when more problems showed up: some attributes of
							 | 
						||
| 
								 | 
							
								the new subclass were lost during processing. I identified the culprit as
							 | 
						||
| 
								 | 
							
								MaskedArray, which returns masked ndarrays when I expected masked
							 | 
						||
| 
								 | 
							
								arrays of my class. I was preparing myself to rewrite *numpy.core.ma*
							 | 
						||
| 
								 | 
							
								when I forced myself to learn how to subclass ndarrays. As I became more
							 | 
						||
| 
								 | 
							
								familiar with the *__new__* and *__array_finalize__* methods,
							 | 
						||
| 
								 | 
							
								I started to wonder why masked arrays were objects, and not ndarrays,
							 | 
						||
| 
								 | 
							
								and whether it wouldn't be more convenient for subclassing if they did
							 | 
						||
| 
								 | 
							
								behave like regular ndarrays.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								The new *maskedarray* is what I eventually come up with. The
							 | 
						||
| 
								 | 
							
								main differences with the initial *numpy.core.ma* package are
							 | 
						||
| 
								 | 
							
								that MaskedArray is now a subclass of *ndarray* and that the
							 | 
						||
| 
								 | 
							
								*_data* section can now be any subclass of *ndarray*. Apart from a
							 | 
						||
| 
								 | 
							
								couple of issues listed below, the behavior of the new MaskedArray
							 | 
						||
| 
								 | 
							
								class reproduces the old one. Initially the *maskedarray*
							 | 
						||
| 
								 | 
							
								implementation was marginally slower than *numpy.ma* in some areas,
							 | 
						||
| 
								 | 
							
								but work is underway to speed it up; the expectation is that it can be
							 | 
						||
| 
								 | 
							
								made substantially faster than the present *numpy.ma*.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Note that if the subclass has some special methods and
							 | 
						||
| 
								 | 
							
								attributes, they are not propagated to the masked version:
							 | 
						||
| 
								 | 
							
								this would require a modification of the *__getattribute__*
							 | 
						||
| 
								 | 
							
								method (first trying *ndarray.__getattribute__*, then trying
							 | 
						||
| 
								 | 
							
								*self._data.__getattribute__* if an exception is raised in the first
							 | 
						||
| 
								 | 
							
								place), which really slows things down.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Main differences
							 | 
						||
| 
								 | 
							
								----------------
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								 * The *_data* part of the masked array can be any subclass of ndarray (but not recarray, cf below).
							 | 
						||
| 
								 | 
							
								 * *fill_value* is now a property, not a function.
							 | 
						||
| 
								 | 
							
								 * in the majority of cases, the mask is forced to *nomask* when no value is actually masked. A notable exception is when a masked array (with no masked values) has just been unpickled.
							 | 
						||
| 
								 | 
							
								 * I got rid of the *share_mask* flag, I never understood its purpose.
							 | 
						||
| 
								 | 
							
								 * *put*, *putmask* and *take* now mimic the ndarray methods, to avoid unpleasant surprises. Moreover, *put* and *putmask* both update the mask when needed.  * if *a* is a masked array, *bool(a)* raises a *ValueError*, as it does with ndarrays.
							 | 
						||
| 
								 | 
							
								 * in the same way, the comparison of two masked arrays is a masked array, not a boolean
							 | 
						||
| 
								 | 
							
								 * *filled(a)* returns an array of the same subclass as *a._data*, and no test is performed on whether it is contiguous or not.
							 | 
						||
| 
								 | 
							
								 * the mask is always printed, even if it's *nomask*, which makes things easy (for me at least) to remember that a masked array is used.
							 | 
						||
| 
								 | 
							
								 * *cumsum* works as if the *_data* array was filled with 0. The mask is preserved, but not updated.
							 | 
						||
| 
								 | 
							
								 * *cumprod* works as if the *_data* array was filled with 1. The mask is preserved, but not updated.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								New features
							 | 
						||
| 
								 | 
							
								------------
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								This list is non-exhaustive...
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								 * the *mr_* function mimics *r_* for masked arrays.
							 | 
						||
| 
								 | 
							
								 * the *anom* method returns the anomalies (deviations from the average)
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Using the new package with numpy.core.ma
							 | 
						||
| 
								 | 
							
								----------------------------------------
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								I tried to make sure that the new package can understand old masked
							 | 
						||
| 
								 | 
							
								arrays. Unfortunately, there's no upward compatibility.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								For example:
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								>>> import numpy.core.ma as old_ma
							 | 
						||
| 
								 | 
							
								>>> import maskedarray as new_ma
							 | 
						||
| 
								 | 
							
								>>> x = old_ma.array([1,2,3,4,5], mask=[0,0,1,0,0])
							 | 
						||
| 
								 | 
							
								>>> x
							 | 
						||
| 
								 | 
							
								array(data =
							 | 
						||
| 
								 | 
							
								 [     1      2 999999      4      5],
							 | 
						||
| 
								 | 
							
								      mask =
							 | 
						||
| 
								 | 
							
								 [False False True False False],
							 | 
						||
| 
								 | 
							
								      fill_value=999999)
							 | 
						||
| 
								 | 
							
								>>> y = new_ma.array([1,2,3,4,5], mask=[0,0,1,0,0])
							 | 
						||
| 
								 | 
							
								>>> y
							 | 
						||
| 
								 | 
							
								array(data = [1 2 -- 4 5],
							 | 
						||
| 
								 | 
							
								      mask = [False False True False False],
							 | 
						||
| 
								 | 
							
								      fill_value=999999)
							 | 
						||
| 
								 | 
							
								>>> x==y
							 | 
						||
| 
								 | 
							
								array(data =
							 | 
						||
| 
								 | 
							
								 [True True True True True],
							 | 
						||
| 
								 | 
							
								      mask =
							 | 
						||
| 
								 | 
							
								 [False False True False False],
							 | 
						||
| 
								 | 
							
								      fill_value=?)
							 | 
						||
| 
								 | 
							
								>>> old_ma.getmask(x) == new_ma.getmask(x)
							 | 
						||
| 
								 | 
							
								array([True, True, True, True, True])
							 | 
						||
| 
								 | 
							
								>>> old_ma.getmask(y) == new_ma.getmask(y)
							 | 
						||
| 
								 | 
							
								array([True, True, False, True, True])
							 | 
						||
| 
								 | 
							
								>>> old_ma.getmask(y)
							 | 
						||
| 
								 | 
							
								False
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Using maskedarray with matplotlib
							 | 
						||
| 
								 | 
							
								---------------------------------
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Starting with matplotlib 0.91.2, the masked array importing will work with
							 | 
						||
| 
								 | 
							
								the maskedarray branch) as well as with earlier versions.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								By default matplotlib still uses numpy.ma, but there is an rcParams setting
							 | 
						||
| 
								 | 
							
								that you can use to select maskedarray instead.  In the matplotlibrc file
							 | 
						||
| 
								 | 
							
								you will find::
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								  #maskedarray : False       # True to use external maskedarray module
							 | 
						||
| 
								 | 
							
								                             # instead of numpy.ma; this is a temporary #
							 | 
						||
| 
								 | 
							
								                             setting for testing maskedarray.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Uncomment and set to True to select maskedarray everywhere.
							 | 
						||
| 
								 | 
							
								Alternatively, you can test a script with maskedarray by using a
							 | 
						||
| 
								 | 
							
								command-line option, e.g.::
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								  python simple_plot.py --maskedarray
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Masked records
							 | 
						||
| 
								 | 
							
								--------------
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Like *numpy.ma.core*, the *ndarray*-based implementation
							 | 
						||
| 
								 | 
							
								of MaskedArray is limited when working with records: you can
							 | 
						||
| 
								 | 
							
								mask any record of the array, but not a field in a record. If you
							 | 
						||
| 
								 | 
							
								need this feature, you may want to give the *mrecords* package
							 | 
						||
| 
								 | 
							
								a try (available in the *maskedarray* directory in the scipy
							 | 
						||
| 
								 | 
							
								sandbox). This module defines a new class, *MaskedRecord*. An
							 | 
						||
| 
								 | 
							
								instance of this class accepts a *recarray* as data, and uses two
							 | 
						||
| 
								 | 
							
								masks: the *fieldmask* has as many entries as records in the array,
							 | 
						||
| 
								 | 
							
								each entry with the same fields as a record, but of boolean types:
							 | 
						||
| 
								 | 
							
								they indicate whether the field is masked or not; a record entry
							 | 
						||
| 
								 | 
							
								is flagged as masked in the *mask* array if all the fields are
							 | 
						||
| 
								 | 
							
								masked. A few examples in the file should give you an idea of what
							 | 
						||
| 
								 | 
							
								can be done. Note that *mrecords* is still experimental...
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Optimizing maskedarray
							 | 
						||
| 
								 | 
							
								----------------------
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Should masked arrays be filled before processing or not?
							 | 
						||
| 
								 | 
							
								--------------------------------------------------------
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								In the current implementation, most operations on masked arrays involve
							 | 
						||
| 
								 | 
							
								the following steps:
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								 * the input arrays are filled
							 | 
						||
| 
								 | 
							
								 * the operation is performed on the filled arrays
							 | 
						||
| 
								 | 
							
								 * the mask is set for the results, from the combination of the input masks and the mask corresponding to the domain of the operation.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								For example, consider the division of two masked arrays::
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								  import numpy
							 | 
						||
| 
								 | 
							
								  import maskedarray as ma
							 | 
						||
| 
								 | 
							
								  x = ma.array([1,2,3,4],mask=[1,0,0,0], dtype=numpy.float64)
							 | 
						||
| 
								 | 
							
								  y = ma.array([-1,0,1,2], mask=[0,0,0,1], dtype=numpy.float64)
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								The division of x by y is then computed as::
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								  d1 = x.filled(0) # d1 = array([0., 2., 3., 4.])
							 | 
						||
| 
								 | 
							
								  d2 = y.filled(1) # array([-1.,  0.,  1.,  1.])
							 | 
						||
| 
								 | 
							
								  m = ma.mask_or(ma.getmask(x), ma.getmask(y)) # m =
							 | 
						||
| 
								 | 
							
								  array([True,False,False,True])
							 | 
						||
| 
								 | 
							
								  dm = ma.divide.domain(d1,d2) # array([False,  True, False, False])
							 | 
						||
| 
								 | 
							
								  result = (d1/d2).view(MaskedArray) # masked_array([-0. inf, 3., 4.])
							 | 
						||
| 
								 | 
							
								  result._mask = logical_or(m, dm)
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Note that a division by zero takes place. To avoid it, we can consider
							 | 
						||
| 
								 | 
							
								to fill the input arrays, taking the domain mask into account, so that::
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								  d1 = x._data.copy() # d1 = array([1., 2., 3., 4.])
							 | 
						||
| 
								 | 
							
								  d2 = y._data.copy() # array([-1.,  0.,  1.,  2.])
							 | 
						||
| 
								 | 
							
								  dm = ma.divide.domain(d1,d2) # array([False,  True, False, False])
							 | 
						||
| 
								 | 
							
								  numpy.putmask(d2, dm, 1) # d2 = array([-1.,  1.,  1.,  2.])
							 | 
						||
| 
								 | 
							
								  m = ma.mask_or(ma.getmask(x), ma.getmask(y)) # m =
							 | 
						||
| 
								 | 
							
								  array([True,False,False,True])
							 | 
						||
| 
								 | 
							
								  result = (d1/d2).view(MaskedArray) # masked_array([-1. 0., 3., 2.])
							 | 
						||
| 
								 | 
							
								  result._mask = logical_or(m, dm)
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Note that the *.copy()* is required to avoid updating the inputs with
							 | 
						||
| 
								 | 
							
								*putmask*.  The *.filled()* method also involves a *.copy()*.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								A third possibility consists in avoid filling the arrays::
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								  d1 = x._data # d1 = array([1., 2., 3., 4.])
							 | 
						||
| 
								 | 
							
								  d2 = y._data # array([-1.,  0.,  1.,  2.])
							 | 
						||
| 
								 | 
							
								  dm = ma.divide.domain(d1,d2) # array([False,  True, False, False])
							 | 
						||
| 
								 | 
							
								  m = ma.mask_or(ma.getmask(x), ma.getmask(y)) # m =
							 | 
						||
| 
								 | 
							
								  array([True,False,False,True])
							 | 
						||
| 
								 | 
							
								  result = (d1/d2).view(MaskedArray) # masked_array([-1. inf, 3., 2.])
							 | 
						||
| 
								 | 
							
								  result._mask = logical_or(m, dm)
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Note that here again the division by zero takes place.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								A quick benchmark gives the following results:
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								 * *numpy.ma.divide*  : 2.69 ms per loop
							 | 
						||
| 
								 | 
							
								 * classical division     : 2.21 ms per loop
							 | 
						||
| 
								 | 
							
								 * division w/ prefilling : 2.34 ms per loop
							 | 
						||
| 
								 | 
							
								 * division w/o filling   : 1.55 ms per loop
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								So, is it worth filling the arrays beforehand ? Yes, if we are interested
							 | 
						||
| 
								 | 
							
								in avoiding floating-point exceptions that may fill the result with infs
							 | 
						||
| 
								 | 
							
								and nans. No, if we are only interested into speed...
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Thanks
							 | 
						||
| 
								 | 
							
								------
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								I'd like to thank Paul Dubois, Travis Oliphant and Sasha for the
							 | 
						||
| 
								 | 
							
								original masked array package: without you, I would never have started
							 | 
						||
| 
								 | 
							
								that (it might be argued that I shouldn't have anyway, but that's
							 | 
						||
| 
								 | 
							
								another story...).  I also wish to extend these thanks to Reggie Dugard
							 | 
						||
| 
								 | 
							
								and Eric Firing for their suggestions and numerous improvements.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Revision notes
							 | 
						||
| 
								 | 
							
								--------------
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								  * 08/25/2007 : Creation of this page
							 | 
						||
| 
								 | 
							
								  * 01/23/2007 : The package has been moved to the SciPy sandbox, and is regularly updated: please check out your SVN version!
							 |